Introduction

In a world that is increasingly connected, population demographics and trends and their potential changes have significant implications on future economic and environmental planning. In particular, the continual improvement in life expectancy over the recent decades presents a serious challenge for government pension schemes, superannuation funds, and other retirement income providers, as these entities need to ensure that the funds available will be sufficient to pay off future benefits. It is hence very important to develop suitable mortality projection models for measuring the past trends and producing mortality forecasts.

There has been much development in the field of mortality modelling and projection in the last two and a half decades. Ever since Lee and Carter (1992) laid the foundation for stochastic mortality modelling, a large amount of interest from both academics and practitioners has been spurred up, from which there are hundreds of research papers and reports published to date on this area. The initial Lee–Carter model is elegant and straightforward, but it also suffers from various limitations, such as the restriction of constant age sensitivity and the inadequate allowance for potential volatility. Numerous modifications, extensions, and alternatives to the Lee–Carter model have been proposed and tested in the literature—see Booth and Tickle (2008), Cairns et al. (2008), and Cairns et al. (2009) for a comprehensive review and comparison of single-population mortality projection models.

A research area that has recently gained much attention is the co-modelling of multiple populations. Some examples of related populations include geographically and socioeconomically close countries, female and male populations of the same region, and regional compared to national mortality of the same country. It can be argued that these subdivided groups or neighbouring populations are linked by certain common driving forces, and mortality projection models should be developed to capture the underlying relationships properly. One of the main criteria of a good multi-population mortality projection model is biological consistency in forecasting female and male mortality (Li 2013). For instance, it is a well-known feature that females have a higher life expectancy than that of males. A model that forecasts the opposite situation would be difficult to justify due to incompatibility with historical data and current biological knowledge (e.g. Li et al. 2016).

The earliest applications of the Lee–Carter model have treated related populations separately or simply in aggregate. As noted in Li (2013), however, treating two populations as independent when applying the Lee–Carter model would result in divergent mortality projections or possibly a mortality crossover. Without considering related populations jointly, long-term mortality forecasts could become nonsensical and unusable. In order to produce coherent or non-divergent mortality forecasts, Li and Lee (2005) proposed an extension of the Lee–Carter model, known as the augmented common factor model. This multi-population model allows for a common population-wide factor for the main long-term trend as well as an additional factor for short-term deviations of each subpopulation from the main trend. The specific time series modelling of the additional factor ensures convergence in the projected death rate ratios at each age in the long term. Li (2013) then modified the augmented common factor model by using a Poisson assumption to cater for the number of deaths and generalising the structure to incorporate multiple additional factors for each sex. This Poisson common factor model (PCFM) maintains the projected death rate ratio’s convergence at each age, offers a formal model framework for statistical analysis, and provides more flexibility in capturing higher order effects in the data. Some examples of its applications to demographic and insurance problems can be found in Li and Haberman (2015), Li et al. (2016), Parr et al. (2016), and Yang et al. (2016).

The PCFM can be further improved in a number of aspects. Yang et al. (2016) extended the PCFM by incorporating the cohort effect via six different structures, and found that the new structures improve the model fitting, reduce the optimal number of additional factors, and maintain coherent mortality forecasts. This paper, on the other hand, seeks to modify the PCFM in a different fashion. First, the original PCFM does not allow for a different number of male- and female-specific factors. In principle, relaxing this limitation should allow the model to capture more different features or trends between both sexes, resulting in a more flexible model to cope with different situations. Second, a common age sensitivity effect can be imposed on the additional factors. It can potentially improve the parsimony of the model, in terms of reducing the number of parameters required. These two modifications work in an opposite way to some extent—one allows for differences in the period effect whereas the other exploits similar age sensitivity patterns (if any) between females and males. But the main purpose is the same, which is to develop more ways to adapt the PCFM to mortality data of different populations with diverse features.

In short, this paper aims to test the two above-mentioned extensions of the PCFM and examine whether these alternatives show an improvement in performance over the baseline model. The assessment is based on the quality of model fitting as well as forecast accuracy. Section 2 reviews the PCFM and provides details of the two proposed model variations. Section 3 applies the models to datasets of eight populations and analyses the fitting results. Section 4 compares the models in terms of forecasting performance and conducts out-of-sample testings. Section 5 further compares the forecasting performance of the models against those proposed by Hyndman et al. (2013) and Bergeron-Boucher et al. (2018). Finally, Sect. 6 sets forth our concluding remarks and suggestions on future research.

Two modifications of PCFM

The Lee and Carter (1992) model specifies the log central death rate at age x in year t as \(\ln{m}_{x,t}={a}_{x}+{b}_{x} {k}_{t}+{e}_{t}\),where \({a}_{x}\) is the age-specific average log death rate, \({k}_{t}\) is the time-varying mortality index with \({b}_{x}\) as the age-specific sensitivity, and \({e}_{t}\sim \mathrm{\rm N}(0,{\sigma }_{e}^{2})\). The estimated trend of \({k}_{t}\) is usually downward and fairly linear, and so it is often modelled as a random walk with drift. The Li and Lee (2005) model and the PCFM are its extensions to cater for multiple populations.

Under the PCFM (Li 2013), the force of mortality \({\mu }_{x,t,i}\) at age x in year t for sex i (i = 1 for females and i = 2 for males) is assumed to be constant over each integer age-period interval, so the central death rate \({m}_{x,t,i}={\mu }_{x,t,i}\). The number of deaths is modelled as a Poisson random variable, i.e. \({D}_{x,t,i}\sim {\text{Poisson}}({E}_{x,t,i} {m}_{x,t,i})\), where \({D}_{x,t,i}\) is the number of deaths and \({E}_{x,t,i}\) is the corresponding exposure to risk. While \({E}_{x,t,i}\) is a known quantity, \({m}_{x,t,i}\) is an unknown parameter that requires estimation from the data.

The Poisson assumption has quite a few advantages. As argued in Li (2013), this assumption leads to a rigorous statistical framework for analysing mortality data. Also, treating the number of deaths \({D}_{x,t,i}\) as a counting random variable is a more natural choice compared to modelling the death rate with a homoscedastic error term in earlier work such as Lee and Carter (1992) and Li and Lee (2005). This Poisson framework is widely used in the literature—see Brouhns et al. (2002) and Cairns et al. (2009) for previous applications. When assessing uncertainty in mortality changes, however, observed death counts appear to be over-dispersed for many countries (e.g. Cairns et al. 2009). For such cases, the Poisson assumption above can readily be modified as over-dispersed Poisson, in which the mean \({\rm E}\left({D}_{x,t,i}\right)={E}_{x,t,i} {m}_{x,t,i}\) remains the same while the variance is defined as \({\text{Var}}\left({D}_{x,t,i}\right)=\phi {E}_{x,t,i} {m}_{x,t,i}\) instead, with \(\phi >1\) as the dispersion parameter (Renshaw and Haberman, 2006). In general, there is no change needed for the computation algorithm and the parameter estimates would stay the same, except that the extra dispersion parameter has to be calculated separately from the deviance function.

Extended from the augmented common factor model in Li and Lee (2005), the log central death rate under the PCFM is modelled as:

$$\ln{m}_{x,t,i}={a}_{x,i}+{B}_{x} {K}_{t}+\sum \limits_{j=1}^{n}{b}_{x,i,j} {k}_{t,i,j}$$

where \({a}_{x,i}\) represents the overall age effect, \({B}_{x} {K}_{t}\) is the common factor for both sexes, and \({b}_{x,i,j} {k}_{t,i,j}\) is the jth additional sex-specific factor (a total of n factors) for sex i. In more detail, \({K}_{t}\) is the mortality index of the common factor, and \({B}_{x}\) measures the sensitivity of the log central death rate to changes in \({K}_{t}\) at each age. Similarly, \({k}_{t,i,j}\) is the time-varying component of the jth additional factor with age sensitivity measure \({b}_{x,i,j}\). Compared to Li and Lee (2005), the PCFM allows for the incorporation of multiple additional factors where necessary, resulting in improved modelling results (e.g. Li et al. 2016).

The (conditional) maximum likelihood parameter estimates of the PCFM are calculated via an iterative updating scheme (see “Appendix” for details). In order to ensure model identification, the model is subject to (4n + 2) constraints, including \({\sum }_{x}{B}_{x}=1\), \({\sum }_{t}{K}_{t}=0\), \({\sum }_{x}{b}_{x,i,j}=1\), and \({\sum }_{t}{k}_{t,i,j}=0\). To determine the optimal number of additional factors, the Bayesian Information Criterion (BIC)Footnote 1 is used as the main statistical measure to strike a balance between goodness-of-fit and over-parameterisation. Other considerations include the patterns in the residual plots, the trends of the additional parameters, and the volume of data being under investigation.

After calibration of the model, the time-varying components \({K}_{t}\) and \({k}_{t,i,j}\) need to be projected into the future. Previous studies have shown that the common mortality index \({K}_{t}\) tends to be linear and decreasing for various countries (e.g. Li et al. 2016). Hence, \({K}_{t}\) can be modelled as a random walk with drift:

$${K}_{t}=\theta +{K}_{t-1}+{\delta }_{t}$$

where \(\theta\) is the drift term and \({\delta }_{t}\) is a normally distributed random variable with mean 0 and variance \({\sigma }^{2}\). On the other hand, the time-varying terms \({k}_{t,i,j}\) are used to represent short-term deviations from the main trend for each sex, so a mean-reverting process would be preferred. Each \({k}_{t,i,j}\) is then assumed to follow a weakly stationary autoregressive AR(p) process:

$${k}_{t,i,j}={\alpha }_{0,i,j}+\sum \limits_{l=1}^{p}{\alpha }_{l,i,j} {k}_{t-l,i,j}+{\varepsilon }_{t,i,j}$$

where \({\alpha }_{0,i,j}\) and \({\alpha }_{l,i,j}\) are autoregressive parameters and \({\varepsilon }_{t,i,j}\) is a normal error term with mean 0 and variance \({\omega }_{i,j}^{2}\). The order p is chosen based on the partial autocorrelation functions (PACFs) of the time-varying components and also the autocorrelations of the residuals. Additionally, each \({k}_{t,i,j}\) is assumed to be independent of the others.Footnote 2 Under these conditions, future death rates (in year t > T) can be projected as:

$${\hat{m}}_{x,t,i}={\dot{m}}_{x,T,i}\exp\left({B}_{x}\left({\hat{K}}_{t}-{K}_{T}\right)+\sum \limits_{j=1}^{n}{b}_{x,i,j}\left({\hat{k}}_{t,i,j}-{k}_{T,i,j}\right)\right)$$

where \({\hat{K}}_{t}\) and \({\hat{k}}_{t,i,j}\) are projected values from their respective time series processes setting the error terms to zero, and the starting point of projection, \({\dot{m}}_{x,T,i}\), is taken from the latest set of data observed in year T. This arrangement helps to avoid significant bias in the beginning part of the projection period (Lee and Miller 2001). Using these projections, the projected male-to-female ratio of death rates can be expressed as:

$$\frac{{\hat{m}}_{x,t,2}}{{\hat{m}}_{x,t,1}}=\frac{{\dot{m}}_{x,T,2}}{{\dot{m}}_{x,T,1}}\exp\left(\sum \limits_{j=1}^{n}\left({b}_{x,2,j}\left({\hat{k}}_{t,2,j}-{k}_{T,2,j}\right)-{b}_{x,1,j}\left({\hat{k}}_{t,1,j}-{k}_{T,1,j}\right)\right)\right)$$

This projected ratio converges to a constant over time at each age if \({\hat{k}}_{t,1,j}\) and \({\hat{k}}_{t,2,j}\) also converge across time. This ‘coherence’ property (Li and Lee 2005) holds as long as the fitted time series processes of \({k}_{t,i,j}\) are all weakly stationary. In cases where a fitted process is not weakly stationary, an alternative model such as an AR(1) with \(\left|{\alpha }_{1,i,j}\right|<1\) or a random walk without drift may also be used arbitrarily to preserve the coherence property.

We now propose the first modified version of the PCFM with variable sex-specific factors (PCFM-VSF), which relaxes the initial assumption that the number of additional factors is the same between both sexes. This approach can be interpreted as allowing for the existence of potential factors or trends that affect only one sex or impact each sex differently. For example, in recent decades, while life expectancies of both sexes have continually improved, male life expectancy has increased faster than female life expectancy. Li (2013) demonstrated that for Australian data, the observed difference in life expectancy at birth between both sexes was approximately 7 years in 1968, reduced to 4.5 years in 2007, and the gap can be projected to narrow to 3.1 years in 2050. There seem to be some underlying factors that drive male mortality to improve faster than female mortality. Utilising the same notation as above, the log central death rate is then modelled as:

$$\ln{m}_{x,t,i}={a}_{x,i}+{B}_{x} {K}_{t}+\sum \limits_{j=1}^{{n}_{i}}{b}_{x,i,j} {k}_{t,i,j}$$

in which there are ni additional factors specifically for sex i. The parameter estimation procedure remains about the same with some modifications (see “Appendix”). Future death rates under this modified model are projected as:

$${\hat{m}}_{x,t,i}={\dot{m}}_{x,T,i}\exp\left({B}_{x}\left({\hat{K}}_{t}-{K}_{T}\right)+\sum \limits_{j=1}^{{n}_{i}}{b}_{x,i,j}\left({\hat{k}}_{t,i,j}-{k}_{T,i,j}\right)\right)$$

and the projected male-to-female ratio of death rates becomes:

$$\frac{{\hat{m}}_{x,t,2}}{{\hat{m}}_{x,t,1}}=\frac{{\dot{m}}_{x,T,2}}{{\dot{m}}_{x,T,1}}\exp\left(\sum \limits_{j=1}^{{n}_{2}}{b}_{x,2,j}\left({\hat{k}}_{t,2,j}-{k}_{T,2,j}\right)-\sum \limits_{j=1}^{{n}_{1}}{b}_{x,1,j}\left({\hat{k}}_{t,1,j}-{k}_{T,1,j}\right)\right)$$

Hyndman and Ullah (2007) proposed a generalised version of the Lee–Carter model using a functional data analysis approach. One of the extensions for modelling multiple groups therein considered a common age effect (CAE), where different populations share the same age-period sensitivity. Kleinow (2015) also proposed a CAE extension of the Lee–Carter model, and found that compared to a separate age effect for each country, the CAE resulted in a better model fit, based on the use of ten countries’ data. This common age effect concept can readily be adapted to the PCFM. Figure 1 shows the computed \({b}_{x,i,1}\) values, which are rescaled to be more comparable between both sexes. They represent the age sensitivity to the sex-specific short-term deviation from the main trend. It is interesting to see that the \({b}_{x,i,1}\) values from the original PCFM demonstrate peaks and troughs at fairly similar ages between females and males for England and Wales, Denmark, Canada, and the United States.Footnote 3 For instance, for the United States, the highest sensitivities can be found at around ages 35 and 70. This observation provides us an incentive to seek a more efficient use of model parameters. Accordingly, we propose another modification of the PCFM with a common age effect (PCFM-CAE), in which the age sensitivity measures in each additional factor are assumed to be equal between both sexes, that is, \({b}_{x,1,j}={b}_{x,2,j}\). This assumption allows a more parsimonious use of parameters and may lead to a lower BIC value in certain cases. As a result, the log central death rate is modelled as:

Fig. 1
figure 1

Computed \({b}_{x,i,1}\) values from the baseline PCFM for both sexes

$$\ln{m}_{x,t,i}={a}_{x,i}+{B}_{x} {K}_{t}+\sum \limits_{j=1}^{n}{b}_{x,j} {k}_{t,i,j}$$

Again, some changes are needed for the parameter estimation procedure (see “Appendix”). Future death rates under the PCFM-CAE are projected using the same equation as the baseline PCFM, with \({b}_{x,i,j}\) replaced by \({b}_{x,j}\):

$${\hat{m}}_{x,t,i}={\dot{m}}_{x,T,i}\exp\left({B}_{x}\left({\hat{K}}_{t}-{K}_{T}\right)+\sum \limits_{j=1}^{n}{b}_{x,j}\left({\hat{k}}_{t,i,j}-{k}_{T,i,j}\right)\right)$$

and the projected male-to-female ratio of death rates is equal to:

$$\frac{{\hat{m}}_{x,t,2}}{{\hat{m}}_{x,t,1}}=\frac{{\dot{m}}_{x,T,2}}{{\dot{m}}_{x,T,1}}\exp\left(\sum \limits_{j=1}^{n}{b}_{x,j}(({\hat{k}}_{t,2,j}-{k}_{T,2,j})-({\hat{k}}_{t,1,j}-{k}_{T,1,j}))\right)$$

Note that since \({b}_{x,j}\) of both sexes are equal, there are two alternative conditions to make this projected ratio converge over time. The first is the same as before—both \({\hat{k}}_{t,1,j}\) and \({\hat{k}}_{t,2,j}\) need to converge. The second (less stringent) condition is that only convergence in \({\hat{k}}_{t,2,j}-{\hat{k}}_{t,1,j}\) is required, but not \({\hat{k}}_{t,1,j}\) and \({\hat{k}}_{t,2,j}\) themselves. For instance, \({\hat{k}}_{t,2,j}-{\hat{k}}_{t,1,j}\) is modelled by a weakly stationary time series process, while the time series process for \({\hat{k}}_{t,1,j}\) is not necessarily stationary.

Fitting results

The mortality data of Australia, England and Wales, France, West Germany, Denmark, Sweden, Canada, and the United States from 1970 to 2011 are obtained from the Human Mortality Database (HMD 2019). The eight developed countries are selected on the basis that they are good representatives of some major continents including Australasia, Europe, and North America. As shown in the following analysis, the data of these countries call for different model choices amongst the alternatives considered, highlighting the importance to have more flexibility in the modelling approach. These datasets are sorted by sex, calendar year, and single year of age. As the exposure to risk and death counts at advanced ages are generally too volatile,Footnote 4 the age range 0–89 is chosen to perform a more precise analysis. Moreover, in line with previous studies (Li 2013; Yang et al., 2016), the year 1970 is chosen as the start of the sample period in order to avoid the structural changes in mortality improvementFootnote 5 that occurred around that time. This choice ensures that the data used are relevant and helps to make projections more straightforward than otherwise.

The optimal model fitFootnote 6 is decided by analysing the BIC values and standardised residual plots. Table 1 shows the BIC results of the PCFM-VSF and PCFM-CAE for the eight selected populations. Note that the case of n = 0 refers to having no additional factors for both sexes. The baseline PCFM results are indeed included in the diagonal of the table, as the PCFM-VSF is effectively an expansion of the original model. For Australia, the baseline model results agree with those from Li (2013) and Yang et al. (2016) that a use of two additional factors is the optimal choice, leading to the lowest BIC value (72,718) on the diagonal. The PCFM-CAE results suggest a further improvement in the BIC (71,838), where there is no change in the optimal number of additional factors. By contrast, the PCFM-VSF results indicate that the optimal choice is to use two male-specific factors but only one female-specific factor (72,674). For the Australian data, both new models produce a better fit than the baseline model, in terms of the BIC, with the PCFM-CAE as the most optimal choice amongst all.

Table 1 BIC values of fitted PCFM (diagonal), PCFM-VSF (matrix), and PCFM-CAE (right column), Australia, England & Wales, France, West Germany, Denmark, Sweden, Canada, United States

For England and Wales, there is no improvement shown by the PCFM-VSF over the baseline PCFM—both versions indicate n = 3 for both male- and female-specific factors (90,810). In comparison, the PCFM-CAE suggests one more additional factor for each sex (n = 4) which results in a better model fit under the common age effect (89,174).

For France, the baseline PCFM points to the use of four additional factors (94,176). Similar to the Australian results, the PCFM-VSF suggests that the optimal model choice is to remove one factor from the optimal baseline model, resulting in three female-specific factors and four male-specific factors (93,743). On the other hand, the PCFM-CAE leads to a different conclusion of putting more factors instead, in which the optimal choice is using five additional factors for each sex (92,980). For the French data, both new models deliver a better fit than the original model, again with the PCFM-CAE being the most optimal version.

For West Germany, the PCFM-VSF shows no improvement over the baseline PCFM here at n = 4 (101,580), but the PCFM-CAE suggests using two extra additional factors for each sex (n = 6), rather than just putting one more as in the previous cases, and leads to a lower BIC value (99,922).

For Denmark, the baseline PCFM suggests n = 1 (63,148), while the PCFM-VSF adds one more female-specific factor and generates a lower BIC value (63,109). On the other hand, the PCFM-CAE recommends using two additional factors for each sex (n = 2) and produces the lowest BIC value (62,848) amongst all the options.

For Sweden, the PCFM-VSF demonstrates no improvement over the baseline PCFM at n = 1 (64,927), while the PCFM-CAE gives a lower BIC value (64,301) via using n = 2. For Canada, the results are similar to those of England and Wales, in which the PCFM-VSF makes no improvement over the baseline PCFM at n = 2 (77,159), whereas the PCFM-CAE requires one more additional factor for each sex (n = 3) and produces a better model fit (76,381). Lastly, for the United States, there are some obvious differences in the model fitting results here compared to the other populations. While the baseline PCFM recommends n = 5 (114,904), the PCFM-VSF suggests placing an extra male-specific factor, leading to five female-specific and six male-specific factors (114,788). As before, the PCFM-CAE supports using more factors, and takes six additional factors for each sex as the optimal choice (115,012). But for the United States data, the PCFM-VSF shows the best model fit, whereas the PCFM-CAE is the least optimal among the three candidates.

Overall, in terms of the BIC, the PCFM-CAE leads to the best model fit for seven of the countries, the PCFM-VSF is the best choice in one case, and the baseline PCFM is the least optimal amongst the three options for seven countries. These results clearly show that the two proposed extensions outperform the original model in our fitting exercise. The underlying age sensitivity patterns may be quite similar between both sexes for certain countries and so setting a common age sensitivity measure may make the model more parsimonious. However, there may also be a resulting offsetting effect from the increase in the number of additional factors required. Moreover, a different number of additional factors for each sex can provide more flexibility for modelling mortality data. Regardless of whether it leads to the best model fit, it is interesting to see that there are more male-specific factors required than female-specific factors in the optimal PCFM-VSF for Australia, France, and the United States. It reflects that these male populations may have more complex mortality patterns or trends which demand a use of more model parameters.

Next, the standardised deviance residualsFootnote 7 from fitting the three optimal models to the eight datasets are examined. For instance, Fig. 2 displays the residuals for females in Australia, France, and the United States. Overall, the residuals look largely random when plotted against age or calendar year.Footnote 8 However, the residuals plotted against cohort year are relatively less random. Particularly, there are differences in the magnitude of these cohort patterns between the models. For Australia, although the PCFM-VSF improves the BIC value over the baseline PCFM, it comes at the cost of more obvious patterns in the residuals for females. This issue has also been found for France, though to a lesser extent. These observations are likely due to the optimal PCFM-VSF model having one less female-specific factor compared to the baseline PCFM. However, this difference is not evident for the United States, owing to the large number of factors being used in both models. For the PCFM-CAE, there are no noticeable differences in cohort residual randomness for Australia, France, and the United States when compared to the baseline PCFM. For England and Wales, as well as Canada (not shown here), the PCFM-CAE shows a slight improvement in the residuals over the baseline PCFM. In contrast, for West Germany (not shown here), the PCFM-CAE leads to slightly more pronounced patterns in the residuals. In fact, all these cohort patterns (mainly for some older individual cohorts) may be addressed by modifying the PCFM-VSF and PCFM-CAE to include a cohort factor in a similar approach to Yang et al. (2016).Footnote 9

Fig. 2
figure 2

Standardised deviance residuals for the PCFM, PCFM-VSF, and PCFM-CAE, females in Australia, France, and United States

To conduct another check on the suitability of the model fit, the mean absolute percentage error (MAPE) values of the fitted log central death rates are displayed in Table 2. The MAPE is defined as:

Table 2 MAPE values of fitted log central death rates (and original scale)
$$\frac{1}{{n}_{d}}\sum \limits_{x,t,i}\left|\frac{\ln{\tilde{m}}_{x,t,i}-\ln({d}_{x,t,i}/{E}_{x,t,i})}{\ln({d}_{x,t,i}/{E}_{x,t,i})}\right|$$

where \({n}_{d}\) is the number of data points, \({\tilde{m}}_{x,t,i}\) is the fitted log central death rate, and \({d}_{x,t,i}\) and \({E}_{x,t,i}\) are the observed number of deaths and exposure to risk respectively. The MAPE values are expressed in percentages below. In addition, the MAPE values on the fitted central death rates (original scale) are also provided in brackets. Note that no MAPE values of the PCFM-VSF are provided for those countries where the PCFM-VSF makes no improvement over the baseline PCFM.

All the MAPE values of the fitted log death rates are very small, largely being under 2%. The MAPE values of the fitted death rates are mostly around 5% or less, except for Denmark and Sweden which have more volatile data due to their smaller population sizes. The differences between the three models look rather immaterial, and none of the models has a clear advantage over the others. In terms of only the goodness-of-fit (MAPE), the performances of all the three models are satisfactory.

As mentioned earlier, there seems to be some trade-off between setting a common age sensitivity measure (fewer parameters) and placing more additional factors (more parameters). For seven of the countries being considered, the PCFM-CAE applications require incorporating one or two extra additional factors compared to the baseline PCFM. Consequently, there are more time-varying components in the structure and so more time series processes are needed to perform future projections and simulations. Despite a better model fit, using more time series assumptions may complicate the projection and simulation exercise, and caution must be taken on the stationarity and other properties of the time series processes adopted.

To further test the flexibility offered by the PCFM-VSF and PCFM-CAE, we consider a ‘hybrid’ between the two new models, in which the PCFM-CAE structure is extended to include one extra (either) male- or female-specific factor. Table 12 in the “Appendix” gives the BIC values of fitting this hybrid model to the eight datasets. The hybrid model turns out to be optimal for France, Denmark, and the United States. This model variation can help address the potential limitation of the PCFM-CAE in assuming all the same \({b}_{x,j}\) between females and males.

Projection results

This section examines the accuracy and coherence of the mortality forecasts produced by the optimal model under each of the three variations. As noted previously, the common mortality index \({K}_{t}\) is modelled as a random walk with drift, while the time-varying terms \({k}_{t,i,j}\) are modelled as AR(p) processes. The selected order p is based on the PACFs of the time-varying components, the autocorrelations of the residuals, and whether the fitted process is weakly stationary. When the fitted AR(p) process is not weakly stationary, a random walk without drift is used as a substitute in this section. The purpose is to ensure that the projected male-to-female ratio of death rates at each age eventually tends to a constant, thus avoiding the potential problems of unrealistic projected paths and mortality crossover between both sexes. An example requiring this arbitrary adjustment in our analysis is the second female- and male-specific factors of the PCFM-CAE for England and Wales, where the autoregressive parameters are greater than one, and if unadjusted, the corresponding projected death rate ratio does not converge, and the model forecasts are then not coherent.

Figure 3 illustrates the projected male-to-female ratios of death rates under the three models for four of the countries considered. In general, the three models show similar projection properties. There are some slight differences in the direction and speed of convergence in the projected ratios over time across the various age groups, but all the projected ratios converge ultimately (though very slowly in some cases) and appear roughly in line with the past trends. One potential issue arises with the PCFM-CAE projections for West Germany—because the optimal model has two more additional factors than the baseline PCFM, there are four extra time-varying components being projected. Using more time series processes due to the extra components results in the projected ratios having much stronger autoregressive effects, being most evident in the 0–9 and 10–29 age groups.Footnote 10 Although these projected ratios still converge gradually over time and the projected levels reflect the past data reasonably well, the fluctuations may nevertheless be undesirable for forecasting purposes. This problem could be mitigated by using AR(1) models as opposed to AR(p) models for the time-varying components, but this alternative method is rather arbitrary.

Fig. 3
figure 3

Projected (and observed) male-to-female ratios of death rates

The accuracy of the mortality forecasts is assessed via out-of-sample testing, also known as backtesting, in which the PCFM, PCFM-VSF, and PCFM-CAE mortality projections are compared against the actual observations. The data period is split into two parts—the first 30 years’ data are used to fit each model and the remaining 12 years are used to evaluate the forecasting performance of that model. The projection accuracy is measured by the MAPE of the projected log central death rates for each sex, defined in a similar manner to the previous section.

Table 3 presents the MAPE values of the projected estimates of the eight countries under the three models. The MAPE values of the projected male and female log death rates are mostly less than 4%, and the differences between the models are very small. It is promising to see that with fewer parameters (roughly about 100 fewer on average), the new models can still produce a similar forecasting performance. When considering the projected male-to-female ratios, the PCFM-CAE noticeably outperforms the other two models for most of the countries considered. The more obvious cases are France (5.38%), West Germany (5.01%), and the United States (3.84%). For France, age groups 0–9 and 30–49 are the main contributors to the difference, while for West Germany, age group 10–29 is where the key difference lies.Footnote 11 These results suggest that the PCFM-CAE may be a more appropriate model in terms of forecasting the male-to-female death rate ratios.

Table 3 MAPE values of projected log central death rates and projected male-to-female ratios

When the PCFM-VSF is compared with the baseline PCFM, the differences are not as clear. Regarding the projected male-to-female ratios, the PCFM-VSF outperforms the baseline PCFM for West Germany, slightly underperforms instead for France, and the difference between the two models for the United States is negligible. For France, the PCFM-VSF is slightly worse at the very end of the projection in age groups 0–9 and 10–29. For West Germany, the PCFM-VSF is better at ages 10–29. With fewer (only three) populations to compare, it is difficult to come to any conclusion as to whether the PCFM-VSF would outperform the baseline PCFM.

It is important to understand that the results of out-of-sample testing may change considerably if a different split between the sample and testing periods is used. To further investigate the forecast accuracy of the three models, the analysis is repeated with a sample period of 25 years and a testing period of 17 years. The results of this second out-of-sample test are also presented in Table 3. Regarding the MAPE values of the projected log death rates, the second test tends to produce larger errors than the first test, but there is again not much difference between the models. For the MAPE values of the projected male-to-female ratios, the longer testing period in the second test clearly results in lower accuracy, particularly for Canada and the United States (around 10% to 13%). The projected male-to-female ratios depart more from the actual values at age groups 10–29 and 30–49 for France, Canada, and the United States in the second test. For these three countries, there is an unexpected sudden change in direction in the trends of the observed ratios at ages 10–49. In this test, the PCFM-VSF and PCFM-CAE largely outperform the baseline PCFM in projecting the male-to-female ratios of death rates.

The out-of-sample testing is further conducted on a rolling-window basis, in which successive splits between the sample and testing periods are used (the last 5 years as the testing period and all the previous years as the sample period, then the last 6 years as the testing period and the previous years as the sample period, and so on until the last 20 years as the testing period). Table 4 gives the MAPE values averaged over all the splits. While the differences in the projected log death rates are minimal between the three models, the PCFM-CAE still tends to outperform the baseline PCFM regarding the projected death rate ratios.

Table 4 Average MAPE values of projected log central death rates and projected male-to-female ratios (rolling-window basis)

We also project life expectancies by extrapolating death rates to age 109, assuming that \(m_x=0.7\) eventually (Gampe, 2010) and using the Coale and Guo (1989) method to close out the life table. Table 5 shows the MAPE values of the projected life expectancies at ages 0 and 60 using the rolling-window basis. Again, the differences in accuracy between the three models are negligible. Though the new models use much fewer parameters than the baseline model in general, they can still make life expectancy forecasts of highly comparable quality. It is clear to see the efficiency of the new modifications in forecasting future mortality.

Table 5 Average MAPE values of projected life expectancy values (rolling-window basis)

As discussed above, the models sometimes produce death rate ratio forecasts that are quite different to what has actually happened, e.g. ages 10–29 for Australia and West Germany, and ages 30–69 for France. While it would be difficult for the models to accurately predict a sudden reversal in the trends, it is still useful if the models can produce prediction intervals which capture the possibility of these abrupt changes. Accordingly, we use the semi-parametric bootstrapping process (Brouhns et al. 2005) to generate the 95% prediction intervals for each age group under the three models, and calculate the actual proportions that the observed ratios move outside the intervals. Table 6 lists the percentage of the observed ratios not being captured by the prediction intervals under each model, using a sample period of 35 years and a testing period of 7 years. The PCFM-CAE prediction intervals work fairly well for Australia, England and Wales, Denmark, Canada, and the United States, in which the percentages are 0% to 11% (quite close to the theoretical value of 5%). The PCFM-CAE outperforms the baseline PCFM in this analysis. However, if a longer testing period is used, the proportions of getting outside the prediction intervals may increase. Table 7 gives the corresponding percentage of the observed life expectancy values (at birth) falling outside the prediction intervals under each model. Figure 4 plots the projected and observed values in the testing period with the prediction intervals for females in West Germany and the United States. As shown, the observed values move well within the prediction intervals in many of the cases. For France and West Germany, though the observed values go outside the prediction intervals for a number of times, the extent is mild and the observed values still move close to the bounds. There is also a tendency for the PCFM-VSF and PCFM-CAE to generate slightly wider prediction intervals than the baseline PCFM. It may be caused by the fact that the PCFM-VSF allows a different number of additional factors between both sexes, and that the PCFM-CAE fitting above often requires using more additional factors.

Table 6 Proportions of observed male-to-female ratios of death rates falling outside 95% prediction intervals
Table 7 Proportions of observed life expectancy values at birth falling outside 95% prediction intervals
Fig. 4
figure 4

Observed (solid lines) and projected (dashed lines) female life expectancy values at birth with 95% prediction intervals (dotted lines)

Note that although the semi-parametric bootstrapping already takes parameter error into account, the resulting variability in the simulations may still not capture all the potential abrupt changes, as the patterns in the past data are rather steady. Other methods such as Bayesian Markov chain Monte Carlo simulation (Czado et al. 2005) may also be sought.

In summary, the results in these two sections clearly suggest that the PCFM-CAE and PCFM-VSF have potential advantages over the baseline PCFM. Qualitatively, the new models provide more flexibility; quantitatively, they are shown to be able to deliver a potentially better performance in model fitting and future projections and simulations. In particular, the new modifications can produce a comparable forecasting performance on the death rates using fewer parameters, and at the same time generate better forecasts for the death rate ratios. As demonstrated in Yang et al. (2016), the way to project the death rate ratios can have a significant impact on the pricing of pensions and annuities between the two sexes for the sponsoring entities and insurers. It is very important to have a proper modelling of the death rate ratios.

Comparison with other models

This section further compares the forecasting performances between the proposed models, the product-ratio method proposed by Hyndman et al. (2013), and the sex-ratio approach suggested by Bergeron-Boucher et al. (2018). Suppose \({p}_{x,t}=\sqrt{{m}_{x,t,1} {m}_{x,t,2}}\) and \({r}_{x,t}=\sqrt{{m}_{x,t,1}/{m}_{x,t,2}}\) are the square roots of the products and ratios of central death rates. We apply the product-ratio method as:

$$\begin{aligned} & \ln p_{x,t} = \eta_{x} + \phi_{x,1} \lambda_{t,1} + \phi_{x,2} \lambda_{t,2} + \cdots + \phi_{x,g} \lambda_{t,g} \\ & \ln r_{x,t} = \xi_{x} + \psi_{x,1} \gamma_{t,1} + \psi_{x,2} \gamma_{t,2} + \cdots + \psi_{x,h} \gamma_{t,h} \\ \end{aligned}$$

The parameters \({\eta }_{x}\) and \({\xi }_{x}\) are the average of \(\ln{p}_{x,t}\) and \(\ln{r}_{x,t}\) over time, and \({\phi }_{x,j}\), \({\lambda }_{t,j}\), \({\psi }_{x,j}\), and \({\gamma }_{t,j}\) are the principal components of \(\ln{p}_{x,t}\) and \(\ln{r}_{x,t}\). As females and males have about the same data size, their log death rates have approximately equal variances, and so \(\ln{p}_{x,t}\) and \(\ln{r}_{x,t}\) are largely uncorrelated. The principal component analysis can then be applied separately to each term to compute the model parameters. Following Hyndman et al. (2013), we set \(g=h=6\). We assume that \({\lambda }_{t,1}\) follows a random walk with drift while all the other time-varying parameters follow an AR(p) process. Particularly, the time series processes for \(\ln{r}_{x,t}\) have to be weakly stationary such that the projected death rate ratio at each age converges to a constant over time. On the other hand, the sex-ratio approach proposed by Bergeron-Boucher et al. (2018) assumes a Lee–Carter kind of structure for the log death rate ratio:

$$\ln({m}_{x,t,2}/{m}_{x,t,1})={\alpha }_{x}+I(x<45){\beta }_{x}^{(1)} {\kappa }_{t}^{(1)}+I(x\ge 45){\beta }_{x}^{(2)} {\kappa }_{t}^{(2)}$$

The term \({\alpha }_{x}\) is the age effect and \({\beta }_{x}^{(1)}\) is the age sensitivity to the time-varying parameter \({\kappa }_{t}^{(1)}\) for ages below 45. For ages 45 and over, there is another set of parameters \({\beta }_{x}^{(2)}\) and \({\kappa }_{t}^{(2)}\) to separate the male excess cancer mortality from the male excess accident mortality. Here we assume that the female death rates follow the Lee–Carter model, and that the time-varying parameters \({\kappa }_{t}^{(1)}\) and \({\kappa }_{t}^{(2)}\) follow a weakly stationary AR(p) process. The female death rates are projected by the equation \({\hat{m}}_{x,t,1}={\dot{m}}_{x,T,1}\exp({b}_{x,1}({\hat{k}}_{t,1}-{k}_{T,1}))\), and then the male death rates are projected as \({\hat{m}}_{x,t,2}={\hat{m}}_{x,t,1}\frac{{\dot{m}}_{x,T,2}}{{\dot{m}}_{x,T,1}}\exp(I(x<45){\beta }_{x}^{(1)}({\hat{\kappa }}_{t}^{(1)}-{\kappa }_{T}^{(1)})+I(x\ge 45){\beta }_{x}^{(2)}({\hat{\kappa }}_{t}^{(2)}-{\kappa }_{T}^{(2)}))\).

Table 8 compares the MAPE values computed from the product-ratio method, the sex-ratio approach, and the PCFM-CAE for each country using the rolling-window basis. For the MAPE values of the projected death rates, the performances of the three approaches are quite close. The product-ratio method tends to perform better for female death rates, while the PCFM-CAE and the sex-ratio approach tend to perform better for male death rates. But for the MAPE values of the projected male-to-female ratios, the PCFM-CAE and the sex-ratio approach generally outperform the product-ratio method. The PCFM-CAE performs slightly better than the sex-ratio approach for West Germany, Denmark, Sweden, and the United States, while it is reversed for the other four countries. Moreover, Table 9 shows the corresponding MAPE values of the projected life expectancies at ages 0 and 60. The sex-ratio approach tends to produce lower MAPE values for male life expectancy, while the PCFM-CAE and the product-ratio method tend to produce lower MAPE values for female life expectancy.

Table 8 Average MAPE values of projected log central death rates and projected male-to-female ratios (rolling-window basis)
Table 9 Average MAPE values of projected life expectancy values (rolling-window basis)

Furthermore, Tables 10 and 11 provides the percentages of the observed death rate ratios and the observed life expectancy values falling outside their prediction intervals under each approach (generated from semi-parametric bootstrapping). Figure 5 plots the observed and projected life expectancy values with the prediction intervals for the product-ratio method and the sex-ratio approach. It can be seen that the prediction intervals generated by the sex-ratio approach appear to be too narrow, and so they cannot capture the observed values at a sufficient level compared to the PCFM-CAE and the product-ratio method. The underlying reason may be the use of only three time-varying parameters under the sex-ratio approach.

Table 10 Proportions of observed male-to-female ratios of death rates falling outside 95% prediction intervals
Table 11 Proportions of observed life expectancy values at birth falling outside 95% prediction intervals
Fig. 5
figure 5

Observed (solid lines) and projected (dashed lines) female life expectancy values at birth with 95% prediction intervals (dotted lines) under product-ratio and sex-ratio approaches

Overall, it appears that the PCFM-CAE can provide both a relatively accurate way to forecast the death rates and their ratios and also a more proper allowance for the potential variability, when compared to the other two approaches. Note that the sex-ratio approach adopted here makes use of the Lee–Carter model for the female death rates. Its performance may be significantly different if another mortality model is used.

Concluding remarks

In this paper, we have explored a number of modifications to the PCFM proposed by Li (2013) (which is an extension of Li and Lee, 2005) and evaluated the data fitting and forecasting performance using mortality data from eight populations. The new modifications involve allowing for a different number of additional factors for each sex and incorporating a common age effect in the additional factors between both sexes. Our analysis shows that on the whole the modified PCFM-CAE and PCFM-VSF produce better fitting and projection results than the baseline PCFM. Firstly, the BIC results clearly indicate that the two proposed models lead to a better fitting performance for the data of several countries. The benefit of using fewer parameters more than offsets the corresponding impact on the likelihood value. Secondly, despite the use of much fewer parameters, the forecasting performance is maintained overall, and can even be improved when dealing with the death rate ratios. It evidently represents a much more efficient use of model parameters. Moreover, the two new models tend to generate wider prediction intervals than those produced from the baseline PCFM. The wider intervals would then be able to capture the future actual values more satisfactorily. In addition, the PCFM-CAE produces better death rate ratio projections than the product-ratio method, and it also yields much wider life expectancy prediction intervals than the sex-ratio approach.

It is clear that these new modifications can offer more flexibility to deal with the varying features and patterns in different populations. The hybrid model in the “Appendix”, combining the PCFM-CAE and PCFM-VSF, can further improve the fitting performance for some countries. Note that the projected death rate ratios can have a significant effect on the pricing of pensions and annuities between both sexes. Yang et al. (2016) showed that an incorrect projection of the ratios could lead to an inappropriate cross-subsidisation between female and male annuitants. From the insurer’s or sponsor’s perspective, it is of critical importance to model the relationship between female and male mortality properly.

There are some possible areas for future research. It would be interesting to apply these models to more populations’ datasets and compare the results, to further explore the level of flexibility they would offer. Moreover, it would be useful to implement these models in practical scenarios such as government policy planning and insurance pricing, since a realistic joint projection of female and male mortality is very important in valuing annuities, pensions, and social benefits. Furthermore, a more detailed investigation on the use of time series models would be warranted. The choice of time series modelling affects not only the central projection but also the level of variability in the simulation. For demographic or insurance studies requiring probabilistic calculations, one should be cautious about the specific effects of assuming a particular time series process. As a quick comparison, we have tried using an AR(1) process arbitrarily for all the computed time series. Figure 7 in the “Appendix” shows that the projected male-to-female ratios of death rates then converge faster than what have been shown previously. It means that the pace of reaching coherence would be negatively related to the order of the AR(p) process. Despite the various statistical tests to determine the order of the time series process, it remains partially a matter of judgement on how fast the projected ratios should converge over time, if one decides to make coherent mortality forecasts.