1 Introduction

Many popular option-implied metrics such as risk-neutral variance, skewness and the variance risk premium are calculated based on an estimate of the option-implied volatility surface. We document in this paper that the method for constructing the volatility surface affects these standard option-implied quantities. Our findings hold more generally for any quantity that is extracted from the aggregation of option prices along the strike range.

State-of-the-art methodologies such as the semi-parametric spline interpolationFootnote 1 of Figlewski (2008) or the three-dimensional kernel regression of OptionMetrics (2016) produce surprisingly large differences in standard option-implied quantities. In our sample for S&P 500 options (2004–2017), Bakshi et al. (2003) risk-neutral variance for medium-term maturities, computed with exactly the same procedure, but based on the volatility surface from the interpolation scheme of Figlewski (2008) or OptionMetrics (2016), differs in relative terms by more than 10% on average. The 1-month-ahead variance risk premium varies across both volatility surfaces by a relative margin of on average 60%. Differences are even more troublesome for risk-neutral skewness, where we document relative differences in the order of 200% and more.

The key question to ask is which volatility surface represents market information most accurately? As information is extracted from option prices by means of deterministic manipulations of the observed portions of the volatility surface, it is natural that it is the most accurate volatility surface that also reprices options most accurately. We therefore perform a detailed empirical investigation to understand which volatility surface captures market information most accurately. Our test incorporates the semi-parametric spline interpolation (Figlewski 2008), a three-dimensional non-parametric kernel regression (OptionMetrics 2016), and the parametric Gram–Charlier expansion (Beber and Brandt 2006). In addition, we also propose an one-dimensional non-parametric kernel regression. We compare the statistical accuracy of these four volatility surfaces by means of leave-one-out cross-validation root mean squared errors (RMSEs) and mean absolute errors (MAE). Calculating the average integrated squared second derivative of the respective implied volatility smiles allows us to identify differences in smoothness. Our tests expand across two dimensions: (i) options on the S&P 500 and on the Euro Stoxx 50 and (ii) with an end-of-day and intraday frequency. The time span of the analysis is 2004–2017 for US data and 2002–2017 for European data.

Our main findings are as follows: First, the one-dimensional kernel regression generates the most accurate volatility surface for S&P 500 options by means of the lowest leave-one-out cross-validation RMSE and MAE. This holds for intraday and for end-of-day data.

Second, for the end-of-day analysis of S&P 500 options, the spline-based volatility surface turns out to be the second best, with a RMSE (MAE) that is on average 128% (91%) higher than the RMSE of the one-dimensional kernel regression surface. The three-dimensional kernel regression produces a RMSE that is more than 5 times larger than the RMSE of the one-dimensional kernel regression and an 11 times higher MAE. The Gram–Charlier volatility surface produces the largest RMSEs (MAEs), on average over 9 (23) times larger than the RMSE (MAE) of the best performing volatility surface. The results for Euro Stoxx 50 options confirm our general findings, though the one-dimensional kernel regression and the spline interpolation appear to perform roughly at par here.

Third, state-of-the-art volatility surfaces turn out to be less accurate than the one-dimensional kernel regression volatility surface because these surfaces do not accurately capture market information in the thinly traded tails of the volatility surface. The volatility surface based on spline interpolation shows weakness in capturing the left tail of short-term options. The three-dimensional kernel regression shows severe shortcomings in capturing the left tail of short-, medium- and long-term options. The method of the Gram–Charlier expansion partially captures the at-the-money region of the volatility smiles well, but shows otherwise shortcomings in capturing the left and right tail of options across all maturities.

Fourth, for the intraday analysis we find that the most accurate volatility surface is again the one that is generated by the one-dimensional kernel regression, if data is very scarce. However, the indication about the most accurate method depends on the number of trades that are used to construct the implied volatility surface and diverges between the error measures. While the one-dimensional kernel regression continues to lead in most set-ups with respect to the RMSE, the spline interpolation quickly shows lower MAEs than all other methods as the number of trades grows. This finding can be rationalized by occasional over-fitting in the spline method, which occurs less often if more trades are available.

Our analysis concludes that option-implied information can differ substantially across volatility surfaces. The three-dimensional kernel regression of OptionMetrics (2016) underpredicts option-implied tail risk at the end-of-day data frequency, which translates into systematic biases in risk-neutral skewness and variance. The Gram–Charlier expansion is not flexible enough to closely track observed option prices. The one-dimensional kernel regression appears to produce the most accurate volatility surface for the end-of-day use-case and if data is very scarce. Yet, using the spline interpolation might be beneficial if an intermediate amount of implied volatility observations is available. In that case, the over-fitting tendency in the spline interpolation is already dampened enough to produce the lowest MAEs, although the performance spread to the one-dimensional kernel regression is not large.

Our research study adds to the growing empirical finance literature that exploits option-implied information. By now, this literature is too vast to be reviewed here in detail. Hence, we cannot give credit to all studies and have to leave out important contributions. However, we discuss a selection of recent studies and focus on how these have constructed the option-implied volatility surface.

A large and diverse amount of research studies work with the volatility surface of OptionMetrics (2016): For example, Buss and Vilkov (2012) estimate option-implied correlations and CAPM betas and find that higher option-implied betas go along with higher average returns. Chang et al. (2012) predict a stock’s beta based on Bakshi et al. (2003) implied volatility and skewness estimates. Martin and Wagner (2019) extract a measure for predicting the risk premium for each stock from its associated implied volatility surface. Christoffersen et al. (2017) propose a parametric option pricing model where illiquidity is a driver of the jump intensity of the underlying and can thus explain time variation in the implied volatility surface. Du and Kapadia (2012) compare the information content of the VIX and the Bakshi et al. (2003) implied variance measure and construct a tail risk index from its difference. Hofmann and Uhrig-Homburg (2018) use the wedge between implied volatility observations and respective OptionMetrics (2016) estimates to construct a limits of arbitrage measure.

Driessen et al. (2009) compare portfolios of single stock options with options on the index and find evidence for a substantial option-implied correlation risk premium. Bollerslev and Todorov (2011) introduce various risk measures for realized and expected continuous risk and jump risk under the empirical and risk-neutral probability measure and find that a significant portion of the equity risk premium is compensation for jump risk. The option-implied volatility surface in these innovative studies is constructed based on end-of-day closing prices and based on a version of the spline interpolation methodology that we use in this paper.

Martin (2017) shows that options contain information about the lower bound of the underlying’s expected return. Schneider and Trojani (2015) construct tradable option-implied strategies for higher moments. These studies do not interpolate the option-implied volatility surface, but work with observed option-implied volatilities.

Wright (2016) adopts the multi-dimensional kernel regression of Aït-Sahalia and Lo (1998) to construct a monthly option-implied volatility surface of real interest rates. The author pools all end-of-day volatilities (roughly 25 per day and maturity) within 1 month to stabilize the procedure and obtain a sufficiently smooth volatility surface. Swanson (2016) highlights that pooling prevents that study from working at a higher frequency. Moreover, Swanson (2016) suggests to use the spline methodology to construct daily option-implied volatility surfaces. As evidence in Bliss and Panigirtzoglou (2002) suggests, that method works well for 10 or more option prices per day and maturity. In order to stay in a non-parametric framework while still obtaining smoothness, Jackwerth and Rubinstein (1996) and Jackwerth (2000) propose a method that directly fits the volatility surface to observed data by minimizing the squared fitting error while at the same time maximizing the smoothness of the surface. Our research study contributes to this discussion, as we formally compare the statistical accuracy and smoothness of several state-of-the-art volatility surfaces across different frequencies and different currency zones. For our tests, we select two kernel regression approaches as representatives for the non-parametric class of volatility surfaces. These methods do not explicitly optimize for smoothness, thus enabling us to perform a fair evaluation between methods, since neither the kernel regression of OptionMetrics (2016) nor the spline method of Figlewski (2008) or the Gram–Charlier expansion features an explicit smoothness treatment. As we document, especially the one-dimensional kernel regression is still well capable of producing smooth volatility surfaces.

A large body of the literature studies parametric representations of the implied volatility surface. An early example is Schönbucher (1999), who assumes a diffusion process for the evolution of implied volatility.Footnote 2 He derives restrictions on the parameters to ensure arbitrage-freeness and to prevent the potential model-implied emergence of bubbles in implied volatility as the maturity decreases. The latter leads to a constraint that can essentially serve as a model for the volatility smile at a given maturity. Based on a two-dimensional diffusion for the forward price of an asset and its volatility, Hagan et al. (2002) derive a closed-form solution for a parametric volatility smile in their SABR model. The model is completely specified by four parameters, which essentially describe the level, skew and curvedness of the volatility smile. However, as Gatheral (2006) points out, the lack of a mean reversion component in the volatility diffusion makes it only applicable to short-maturity options. In his SVI model, Gatheral (2004) assumes a parametric function for the volatility smile that is similar to the representation of Schönbucher (1999), but features an additional parameter to locate the volatility smile across the strike range. By construction, the SVI model assumes that the implied volatility smile becomes approximately linear in the tails. This assumption is not always fulfilled in the data, which has triggered the development of generalized versions of the SVI model that allow concavity, for example Zhao and Hodges (2013) or Damghani and Kos (2013). Damghani (2015) further elaborates on this work in the context of the FX option market. His IVP model features an explicit treatment of the bid-ask spreads in option prices and allows to incorporate liquiditiy factors. By adding a maturity interpolation scheme, he is able to capture the whole volatility surface with a dramatic reduction in parameters that need to be estimated. Figlewski (2008) follows a different approach: He argues to use a 4th-order smoothing spline with one knot point at-the-money to model the volatility smile at each maturity separately. The higher amount of free parameters is accepted to reach a higher accuracy while still smoothing out noise. In our study, we use the method of Figlewski (2008) as representative of the class of parametric models of the implied volatility surface.

Survey and methodology papers that are closest to ours are Jackwerth (1999), Jondeau and Rockinger (2000), Bliss and Panigirtzoglou (2002) and Bahaludin and Abdullah (2017). Relative to these studies, our empirical assessment covers a much longer time span (14 years), US and EU equity option markets and distinguishes between characteristics of the input option data sets.

2 Data

We use data for two of the most actively traded equity option contracts: options on the S&P 500 and options on the Euro Stoxx 50. The former is traded at the CBOE while the latter is traded at the Eurex. The S&P 500 and the Euro Stoxx 50 do both stand for the Blue Chip stocks of their respective currency zone. Both option contracts are of European exercise style.

Historical data for S&P 500 options come from the CBOE Livevol data shop. We collect both end-of-day bid-ask prices as well as intraday transaction prices. For the end-of-day data set, we use the mid price for our calculations, whereas we use actual trade prices for the intraday data set. The end-of-day data is from January 2004 to July 2017. The intraday data spans the period January 2004 to October 2017. The data comes with the matched bid and ask prices of the underlying at the point of time of the record.

We obtain end-of-day settlement prices and intraday transaction prices for the Euro Stoxx 50 options and the underlying Euro Stoxx 50 index from the Karlsruher Kapitalmarktdatenbank (KKMDB), which is hosted at the Finance Institute of the Karlsruhe Institute of Technology. The KKMDB receives its data directly from the Eurex and Stoxx. The index data comes in 15 second intervals, while the intraday option data is time stamped at the point of time of the trade. We match the option trade with the index price directly at or prior to the trade’s time stamp. The end-of-day data spans January 2002 to September 2017, whereas the intraday data covers the period January 2003 to September 2017. Table 1 summarizes our option data sets in more detail.

Table 1 Data summary

For the risk-free rate, we use the US-Dollar and Euro OIS curves from Bloomberg, which come in discrete maturities between 1 day and 20 years and which we obtain at the daily frequency. We match each option record with the risk-free rate for its respective date. We interpolate the OIS rate linearly along the maturity dimension to match the maturity of the respective option.

We apply a number of filtering steps to ensure that only valid option prices enter the standardization. We only consider options with a price that fulfills basic no-arbitrage considerations. More precisely, we keep a Call option price C, if \(C \ge S e^{-q \tau } - K e^{-r \tau }\), and a Put option price P, if \(P \ge K e^{-r \tau } - S e^{-q \tau }\), with the current price of the underlying S, a strike of K, maturity of \(\tau \) years, the riskfree rate r and the continuous dividend yield q. In addition, as most methods work in implied volatility space, we remove options, for which the calculation of the Black-Scholes implied volatility did not converge. Inline with Bliss and Panigirtzoglou (2002), we only keep out-of-the-money options and recover prices for in-the-money options (where needed) through put-call parity. The latter is not done for the three-dimensional kernel regression of OptionMetrics (2016), as that method explicitly works with both Call and Put prices.

Both, the S&P 500 and the Euro Stoxx 50 index, are price indices and are thus quoted ex-dividends. Filtering the raw option prices and the calculation of implied volatilities for maturity \(\tau \) requires an estimate of the continuous \(\tau \)-period ahead dividend yield. For each day, we determine the maturity-specific dividend yield as the median (across strikes) of the put-call parity implied dividend yields, using only options with the desired maturity. This estimate is used for all quote filters and implied volatility calculations of a respective day and maturity \(\tau \). More details on the dividend yield estimation and on matching Call and Put prices for put-call parity can be found in “Appendix A”.

3 Constructing a volatility surface

We start this section with providing a quick overview on the most common parametric, semi-parametric and non-parametric approaches for constructing an option-implied volatility surface.

One popular representative for a parametric approach is the Gram–Charlier expansion. It approximates an unknown density by starting with a Gaussian density and, similar to a Taylor expansion, iteratively adding higher-order terms to reduce deviations between the true unknown density and its previous approximation. This parametric approach captures volatility, skewness and kurtosis of the unknown risk-neutral density and allows for closed-form prices of Calls and Puts (Backus et al. 2004), and hence a semi-closed form expression for the option-implied volatility surface. A detailed explanation of this method can be found in “Appendix B.1”.

Interpolating observed option-implied volatilities with polynomial splines is certainly the most popular semi-parametric approach for constructing the option-implied volatility surface. Several variations of the spline interpolation approach exist, mainly differing in the degree of the spline or in the dimension over which the interpolation runs (see, for example, Bliss and Panigirtzoglou 2002; Jiang and Tian 2005; Chang et al. 2012). We follow Figlewski (2008) and Fengler (2009) as their procedure nests many other approaches of the literature. A detailed description of that method is in “Appendix B.2”.

An often applied non-parametric approach is the kernel regression. A prominent volatility surface, that is often used in financial economic research, is the kernel regression specification of OptionMetrics (2016). That data provider approximates missing values of the option-implied volatility surface (across strikes and maturity) with a three-dimensional kernel regression. The corresponding three bandwidth parameters are fixed in OptionMetrics (2016). To keep our paper self-contained, we provide in “Appendix B.3” a detailed description of that non-parametric approach.

We now continue to present our own kernel regression specification, that combines advantages of the above mentioned methods. First, it is a non-parametric approach, which ensures flexibility. Second, it is essentially only one-dimensional, which makes it more robust and less data intensive than existing multi-dimensional kernel regression approaches. Third, our approach ensures the volatility surface to be arbitrage-free, borrowing a technique from Fengler (2009). Fourth, conditional on having sufficient data, our approach takes special care of capturing market information that is hidden in the thinly traded tails.

3.1 One-dimensional kernel regression with tail extrapolation

We continue here to present in detail our suggested variation of the kernel regression method that was popularized by Aït-Sahalia and Lo (1998). Aït-Sahalia and Lo (1998) propose to approximate unobserved option-implied volatilities by the following kernel regression set-up:

$$\begin{aligned} {\hat{\sigma }}(F_j, K_j, \tau _j) = \sum _{i=1}^N \frac{k(F_j - F_i) \, k(K_j - K_i) \, k(\tau _j - \tau _i) }{\sum _{i=1}^N k(F_j - F_i) \, k(K_j - K_i) \, k(\tau _j - \tau _i)} \,\times \sigma (F_i, K_i, \tau _i), \end{aligned}$$
(1)

where N is the number of observed option prices that enter the kernel regression as inputs, \(\sigma (.)\) is the observed Black-Scholes implied volatility and \({\hat{\sigma }}\) is the interpolated Black-Scholes implied volatility for a desired tuple of strike \(K_j\), forward price of the underlying \(F_j\) and maturity \(\tau _j\). The kernel function k(x) has a Gaussian shape with its own bandwidth parameter \(h_x\) for each dimension, i.e.,

$$\begin{aligned} k(x) = \frac{1}{\sqrt{2 \pi }} e^{-\left( \frac{x^2}{2 h_x} \right) }. \end{aligned}$$
(2)

We are now going to present refinements that distinguish our approach from the tested application in Aït-Sahalia and Lo (1998). First, we reduce the input dimension by one unit, as we combine the underlying forward price with the strike level. More precisely, we define the observed moneyness measure as \(m_i = \frac{K_i}{F_i}\) and apply the kernel regression to \(m_j - m_i\) instead of \(F_j - F_i\) and \(K_j - K_i\).Footnote 3 In most cases, this step can be regarded as a mere technical rescaling of the strike axis, which does not affect the interpolation accuracy of the technique. In our end-of-day set-up, we only use the price observations of a single day to construct the respective day’s implied volatility surface, which all have the same underlying price as they are observed at the same point of time. Similarly, in the intraday set-ups, we construct the implied volatility surfaces based on subsequent option trade observations. In the vast majority of cases, the change in the underlying price between two option trades is very small, such that the underlying forward price can be considered approximately constant for the employed set of option trade observations.

The second refinement of the Aït-Sahalia and Lo (1998) approach is that we remove the time to maturity dimension, \(\tau \), from the kernel regression. This means that the kernel regression is purely used to approximate the option-implied volatility surface in the moneyness dimension.Footnote 4 Mathematically, this means we use the following kernel weighting for the moneyness dimension:

$$\begin{aligned} {\hat{\sigma }}(m_j, \tau _i)&= \sum _{i=1}^{N^*} \frac{k(m_j - m_i) \, \sigma (m_i, \tau _i)}{\sum _{i=1}^{N^*} k(m_j - m_i)}, \qquad m_i = \frac{K_i}{F_i}, \end{aligned}$$
(4)

where \(N^*\) denotes the size of the set of observed option prices with maturity \(\tau _i\). We set the bandwidth parameter for the moneyness dimension by the following rule-of-thumb. We first compute the average distance between two neighboring strikes

$$\begin{aligned} \overline{\varDelta K}(\tau _i) = \frac{1}{N^* - 1} (K^{max}(\tau _i) - K^{min}(\tau _i)), \end{aligned}$$

where \(K^{max}(\tau _i)\) (\(K^{min}(\tau _i)\)) represents the maximum (minimum) observed strike at maturity \(\tau _i\). We set the moneyness bandwidth to \(h_m(\tau _i) = 0.75 \overline{\varDelta K}(\tau _i)\). While the coefficient of 0.75 may appear ad-hoc, our tests reveal that the results barely change if it is set at any value within a range of [0.6, 1].

Fig. 1
figure 1

Effect of tail extrapolation on kernel regression. This figure visualizes for a particular day of the sample, May 22nd, 2012, the effect of the linear tail extrapolation on the option-implied volatility smile for Euro Stoxx 50 options with 24 days to maturity. The left panel shows the kernel regression without tail extrapolation, the right panel includes the tail extrapolation. The dots mark observed (and artificial) implied volatility observations, the line depicts the interpolation

Third, far out-of-the-money options are usually not observed, but very important ingredients for capturing the tails of the option-implied volatility surface. For the analysis at the end-of-day frequency, we use a linear extrapolation scheme to increase the number of ’observed’ option-implied volatilities in both tails, before applying the kernel smoother. The extrapolation works as follows: We start with a set of Black-Scholes implied volatilities \(\{\sigma _i\}_{i \in [1, N]}\) for some point of time. We split this set into subsets \(\{\sigma _i^{\tau }\}_{i \in [1, N]}\) with all elements in a subset having the same time to maturity \(\tau \). We assume that the elements of \(\{\sigma _i^{\tau }\}_{i \in [1, N]}\) are sorted ascendingly by their moneyness \(m_i\).

For the left tail extrapolation, we use the first 5 implied volatilities of each subset. The number of 5 observations is a trade-off between stability on the one hand and including only the most extreme tail observations on the other hand. Unreported results show that our findings are robust to variations in that considered number of observations. On these 5 observations, we estimate the following linear regression using OLS:

$$\begin{aligned} \sigma _i^{\tau } = \alpha + \beta \, m_i + \epsilon _i, \quad \epsilon _i \sim i.i.d. \end{aligned}$$

Starting from the lowest observed moneyness, we proceed in steps of \(\overline{\varDelta K}(\tau _i)\) and calculate the extrapolated implied volatility as \({\tilde{\sigma }}(m) = \alpha + \beta \, m\) until reaching a moneyness of \(m = 0.4\). We proceed similarly for the right tail, using the last 5 observations in \(\{\sigma _i^{\tau }\}_{i \in [1, N]}\) and extrapolating from the largest observed moneyness in steps of \(\overline{\varDelta K}(\tau _i)\) until a final moneyness of 1.6. We use the union of observed implied volatilities and thus artificially created implied volatilities as inputs for the kernel regression. Figure 1 presents the tail extrapolation visually for a sample day and maturity.

The last refinement that we apply to the Aït-Sahalia and Lo (1998) methodology is to guarantee that the resulting option-implied volatility surface is consistent with an arbitrage-free asset market by applying the algorithm of Fengler (2009).

Table 2 Cross-validation errors for refinement steps of the kernel regression

Table 2 displays the performance gains that we achieve with each adjustment of the Aït-Sahalia and Lo (1998) method. Clearly, dropping the maturity dimension from the kernel regression leads to the strongest improvement, lowering the RMSE by 77% for the S&P 500 and 92% for the Euro Stoxx 50. Further adding the tail extrapolation leads to an additional 16% reduction in the RMSE of the S&P 500 implied volatility surface (minus 35% for the Euro Stoxx 50). Finally, enforcing no-arbitrage does generally not lead to a higher accuracy of the constructed implied volatility surface. However, as the error figures are not rising, we can reach an arbitrage-free implied volatility surface without having to sacrifice accuracy.

4 On the uniqueness of option-implied information

Option-implied risk and return measures, such as forward-looking risk-neutral variance, skewness or the variance risk premium do have a unique mathematical representation. Yet, the numerical implementation of such measures does often require the integration of option prices across the whole strike spectrum, even though only a subset of these are actually traded in the market. Conditional on a correct volatility surface, Jiang and Tian (2005) show how to implement such an integration numerically. But in real-life applications the problem is that different state-of-the-art approaches for constructing the volatility surface in high resolution across the strike dimension do result in different volatility surfaces. It is the goal of our paper to assess the severity of these differences for a subset of methodologies and to perform a thorough statistical analysis to identify the methodology that comes closest to the unobserved true volatility surface.

Fig. 2
figure 2

Unconditional mean of model-free implied volatility by methodology. This figure presents the sample mean of the Bakshi et al. (2003) model-free implied volatility for S&P 500 options, as a function of maturity and interpolation methodology, together with 95% confidence intervals. The short-term category refers to maturities below 30 days, the mid-term category groups maturities between 30 and 365 days, the long-term category summarizes maturities that are longer than 365 days

Figure 2 summarizes the sample mean of the Bakshi et al. (2003) implied volatilities for S&P 500 options with a remaining maturity of (i) less than 30 days (short-term), (ii) 30–365 days (medium-term) and (iii) more than 365 days (long-term), respectively. Each sample mean is based on one of the four volatility surfaces that we have discussed above. We highlight two insights: First, different volatility surfaces imply different values for a standard risk measure like Bakshi et al. (2003) implied volatility. The average medium-term Bakshi et al. (2003) implied volatility has been 21.0% when using the three-dimensional kernel regression volatility surface and 23.5% when working with the spline-induced volatility surface. That 12% relative spread is statistically significant and obviously large from an economic point of view. Second, while the spline interpolation method and the kernel regression methods tend to produce Bakshi et al. (2003) volatilities that are close to each other for short-term options, the differences build up for maturities larger than 30 days. Noteworthy, the kernel regression methods produce average term structures of Bakshi et al. (2003) implied volatilities that are less steep in the maturity dimension, compared to the other two interpolation methodologies. This results in comparatively low average medium- and long-term estimates for Bakshi et al. (2003) implied volatility.

In contrast to the disperse, yet precisely estimated, sample means of the Bakshi et al. (2003) volatilities, the box plots in Fig. 3 describe the respective full distribution of an interpolation method’s Bakshi et al. (2003) implied volatility. The distributions largely overlap for short- and medium-term maturities. For medium-term maturities and even more strongly for long-term maturities, the downward bias of the kernel regression methods’ estimates becomes visible, which is even more pronounced in the right tail of the distribution.

Fig. 3
figure 3

Distribution characteristics of model-free implied volatility by methodology. This figure presents box plots for the estimates of the Bakshi et al. (2003) model-free implied volatility for S&P 500 options, as a function of maturity and interpolation methodology. The short-term category refers to maturities below 30 days, the mid-term category groups maturities between 30 and 365 days, the long-term category summarizes maturities that are longer than 365 days. The rectangular boxes mark the quartiles of the implied model-free volatility distribution. The solid lines outside of the boxes expand to the 5% and 95% percentiles

Differences in model-free implied variance estimates induce different variance risk premium (VRP) estimates. The details of our VRP calculation are summarized in “Appendix C”. Table 3 summarizes the average annualized VRP for S&P 500 and Euro Stoxx 50 options for different volatility surfaces, respectively. The average annualized VRP for S&P 500 (Euro Stoxx 50) options has been estimated to be between 1.5 and 2.4% (1.9% and 2.9%), depending on the volatility surface. In relative terms, the spline-based volatility surface results on average in a 60% higher S&P 500 VRP estimate relative to the same estimate for the three-dimensional kernel regression. The differences between the average VRP estimates are for nearly all pairs of methods statistically strongly significant. Monthly Sharpe ratios for the VRP range from 0.26 to 0.57 for the S&P 500 and from 0.5 to 0.68 for the Euro Stoxx 50. We highlight that these economically large differences are a direct result of the choice of the inter- and extrapolation method that builds the basis for a volatility surface; the input data and the integration scheme is the same across all methods.

Table 3 Average annualized monthly variance risk premium and Sharpe ratio

Table 4 states the sample mean of model-free option-implied skewness for maturities between 15 and 91 days. Option-implied skewness is calculated as in Bakshi et al. (2003) and reported for different sub-samples.Footnote 5 We split the US data set into a pre-financial-crisis, a crisis and a post-financial-crisis period. We apply the same cuts for the European data set, but further split the post-financial-crisis period into a between-crisis-, an euro-crisis-, and a post-euro-crisis period due to the high impact of the European sovereign debt crisis for European stock markets. The risk-neutral skewness estimates are nearly all significantly different across different volatility surfaces. The estimates with the three-dimensional kernel regression are roughly only half the size when compared to our proposed one-dimensional kernel regression or the spline method. Further, the changes in the average risk-neutral skewness between one sub-period and the next also differ among the volatility surfaces, sometimes even disagreeing on the sign of the change. For example, risk-neutral skewness estimates based on a spline interpolation get less negative on average after the Euro Crisis in Europe, while the estimates based on all other volatility surfaces get more negative.

Table 4 Average Bakshi et al. (2003) skewness by time period and method

In summary, information extracted from option markets is supposed to be unique. But our analysis has documented that this information is sensitive with regard to the volatility surface that a researcher uses. We have shown that risk-neutral model-free estimates for variance and skewness differ by a large margin across different state-of-the-art methods. While we have used the Bakshi et al. (2003) moments for explanatory purpose, our findings hold more generally for any quantity that is extracted from the aggregation of option prices along the strike range.

5 Assessing the accuracy of a volatility surface

Here, we assess the relative advantages and disadvantages of different state-of-the-art volatility surfaces. Our previous findings have documented that standard option-implied measures differ across volatility surfaces. As these measures are deterministic functions of the volatility surface, it is natural that the most accurate volatility surface does imply the most accurate option-implied risk measures.

At the same time, accuracy is only one of multiple practical considerations when constructing an implied volatility surface. It is, for example, well thinkable to accept a lower accuracy in favor of a more informative representation of the volatility surface. Especially parametric models can provide such representations. Depending on the model, single parameters can be interpreted directly and serve as measures that express the situation at the option market in lower dimension. For example, the parameters of the SVI model of Gatheral (2004) inform about the level of the implied volatility smile, its rotation and how wide it is. Parametrization also allows to easily share an implied volatility surface, as it can be fully reconstructed from the relatively few parameters. The model for the volatility smile allows to extrapolate beyond observed option prices in a manner that is consistent with the central part of the modeled volatility smile. No-arbitrage constraints can be incorporated into the construction of the implied volatility surface at the parameter estimation phase already (Damghani and Kos 2013; Damghani 2015). These arguments speak in favor of a parametric representation of the implied volatility surface. However in this study, we explore how accurate different volatility surfaces capture option market information. The result of our analysis can then serve as an important input, next to the previously mentioned concerns, for the choice of a construction method in practice.

One might also be willing to deliberately sacrifice some accuracy in the representation of the observed implied volatilities in favor of a smoother volatility surface (Jackwerth 2000; Jackwerth and Rubinstein 1996). This is especially true, if the noise in the observations is expected to be large enough to not be fully canceled out by the interpolation and smoothing method. In that case, the constructed volatility surface would show spikes or bumps, that would decrease its smoothness. On the other hand, a very smooth volatility surface might plane out important features of the observations and thus introduce biases.Footnote 6 Therefore, we are going to implement a thorough investigation on the statistical accuracy of popular volatility surfaces and compare them with respect to their smoothness.

We follow a rich machine learning literature that assesses the accuracy of a model based on leave-one-out cross-validation (Geisser 1993; Kohavi 1995). The advantage of this cross-validation approach for our study is that every method that we use to construct an option-implied volatility surface is evaluated with data that was not used to construct the surface. This allows us to detect over- and under-fitting.

We evaluate the statistical quality of each volatility surface with end-of-day and intraday data for S&P 500 and Euro Stoxx 50 options. End-of-day data is characterized by a rich panel of option prices for different strikes and maturities, that all refer to the same point of time. We call this to be the data-rich environment. In contrast, intraday data is characterized by a limited amount of observed trade prices in a given time interval. We therefore call the intraday application to be a data-poor environment. The intraday set-up becomes increasingly data-poor as the considered time interval for pooling trade observations shrinks and thus the time resolution increases.Footnote 7

The highest possible time resolution of a method is bound by the minimum amount of option trades that a method requires for constructing the option-implied volatility surface. Technically, the Gram–Charlier expansion requires only 3 observed option prices for distinct strikes at the same maturity. Price observations at 3 different strikes are also the theoretical minimum for capturing the key characteristics level, slope and curvature of the implied volatility smile. We undertake four independent intraday analyses, namely relying on 3, 4, 5 and 10 observations per maturity.

For the end-of-day data set, the cross-validation works as follows: Given price observations of a day, we in turn leave out a single observation and calculate the methodology-implied estimate for that observation. We repeat this procedure for each observed option price. The data-poor set-up with 3, 4, 5 and 10 price observations per maturity is treated similarly: For each new transaction price, we use the corresponding 3, 4, 5 or 10 preceding transaction prices with differing strike prices for the same maturity in order to create an estimate for the next transaction price. This approach is basically assessing out-of-sample how well a volatility surface predicts future option prices.

Given a set of N evaluation samples with Black-Scholes implied volatility observations \(\sigma _i^{BS}, i \in \{1,...,N\}\), our primary evaluation measure is the root mean squared error (RMSE), that arises when comparing the respective method’s Black-Scholes-implied volatility estimator \({\hat{\sigma }}_i^{BS}\) with the left out observation \(\sigma _i^{BS}\):

$$\begin{aligned} RMSE = \sqrt{\frac{1}{N} \sum _{i=1}^N (\sigma _i^{BS} - {\hat{\sigma }}_i^{BS})^2}. \end{aligned}$$
(5)

A detailed discussion of the relationship between the (root) mean squared error as evaluation criterion and the desire to obtain a smooth volatility surface can be found in “Appendix D”. By squaring the errors, the RMSE penalizes large deviations of the constructed volatility surface from the observed implied volatilities more strongly than small deviations. However, as discussed above, one might be willing to accept occasional large errors in favor of a smoother volatility surface. This is especially true, if one expects such large errors to be due to outliers, that are not representative of the true unobserved volatility surface. For this reason, we also calculate the mean absolute error (MAE),

$$\begin{aligned} MAE = \frac{1}{N} \sum _{i=1}^N \left| \sigma _i^{BS} - {\hat{\sigma }}_i^{BS} \right| , \end{aligned}$$
(6)

which is less responsive to such occasional large errors and is still comparably low, if these errors only occur seldom and the remaining fit is good. For both error measures, we exclude observations with an implied volatility above 10.Footnote 8

We also compare the end-of-day volatility surfaces with respect to their smoothness directly. Following Jackwerth (2000), we measure smoothness as the sum of squared second derivatives of the implied volatility surface along the moneyness dimension. More precisely, for each day t and maturity \(\tau \), we consider a moneyness range of \([1 - k \, s \, \sqrt{\tau }, 1 + k \, s \, \sqrt{\tau }]\), where s is an estimate for the unconditional annual volatility of the underlying and k is a fix multiple. We discretize this range in steps of \(\varDelta = \frac{2 k s \sqrt{\tau }}{99}\) and construct the volatility smile \(\{{\hat{\sigma }}_{j,t,\tau }^{BS}\}_{j \in [0, 100]}\) for the grid points of that day and maturity with each method. Our measure for the smoothness of the smile is then calculated as

$$\begin{aligned} S_{t,\tau } = \sum _{j=1}^{99} \left( \frac{{\hat{\sigma }}_{j-1,t,\tau }^{BS} - 2 {\hat{\sigma }}_{j,t,\tau }^{BS} + {\hat{\sigma }}_{j+1,t,\tau }^{BS}}{\varDelta ^2} \right) . \end{aligned}$$
(7)
Table 5 End-of-day cross-validation RMSE

Finally, we compute the mean of that smoothness measure across all days and maturities and take the square root for better readability. In that, we exclude the 1% of the smiles with the highest and the 1% with the lowest smoothness measure to mitigate the impact of the tails of the distribution of smoothness figures on the mean estimates.

6 Findings

This section summarizes key findings of our empirical assessment of the statistical quality of different state-of-the-art volatility surfaces. We start with the end-of-day analysis for S&P 500 and Euro Stoxx 50 options. This section ends with the findings for the intraday analysis.

6.1 End-of-day: data-rich environment

The aggregated RMSEs for the end-of-day, data-rich, environment are summarized in Table 5, the respective MAE error figures can be found in Table 9 in “Appendix E”. For S&P 500 options we find that the volatility surface from the one-dimensional kernel regression produces the lowest RMSE (0.0092) and MAE (0.0019). The spline interpolation induced volatility surface ranks second with an overall RMSE (MAE) that is 118% (91%) higher. The three-dimensional kernel regression ranks third place, with an RMSE (MAE) that is roughly 6 (12) times larger, relative to the one-dimensional kernel regression. The volatility surface of the Gram–Charlier expansion turns out to be the least accurate, producing a RMSE (MAE) that is more than 9 (23) times larger than the winning volatility surface.

Looking at the RMSE results for the Euro Stoxx 50 confirms that the volatility surface that is generated by the one-dimensional kernel regression is the most accurate one, followed by the spline-based volatility surface, whose overall RMSE is 37% higher. However, the ranking is reversed with respect to the MAE, where the spline method performs best and the kernel regression shows a 29% higher error. This reversal is the result of a downward bias of the one-dimensional kernel regression in deep out-of-the-money medium-term options with a moneyness below 0.5. If these options were to be excluded from the error calculation, the one-dimensional kernel regression and the spline method would produce basically the same MAE for Euro Stoxx 50 data. Finally, the three-dimensional kernel regression and the Gram–Charlier expansion produce RMSEs (MAEs) that are 14 (25) and 12 (24) times higher, respectively.

We now continue to report the RMSE for different regions of the option-implied volatility surface. We split the option-implied volatility surface along the maturity and moneyness dimension. Options with a maturity of less than 30 calendar days are considered short-term, options with a maturity between 30 and 365 calendar days are considered medium-term, while options with a maturity of more than 1 year are called long-term. Along the moneyness axis, we label a moneyness of 0.95 to 1.05 as at-the-money (ATM), whereas the ‘left tail’ (‘right tail’) is characterized by a moneyness of below (above) 0.95 (1.05). In combination, these splits yield nine different regions of the option-implied volatility surface, for which we calculate the RMSE and MAE of each method.

Looking at panels (a) and (b) of Tables 5 and 9 highlights that across all volatility surfaces, short-term options and left tail options produce the highest errors. For the S&P 500, the one-dimensional kernel regression performs best in most sections of the implied volatility surface with respect to the RMSE and in all sections with respect to the MAE. Spline interpolation tends to outperform the one-dimensional kernel regression for medium- and long-term options for the Euro Stoxx 50, though, which is more pronounced in the RMSE results than in the MAE results. The three-dimensional kernel regression and the Gram-Charlier expansion do both show severe difficulties in capturing the left and the right tail of the surface. The problem of the Gram–Charlier volatility surface is that the parametric risk-neutral density approximation turns out to be insufficient for capturing market information in the tails. We identify that the relative weakness of the three-dimensional kernel regression is that it only considers options with a delta of 0.2–0.8, which ignores market information about the tails.

Table 6 Smoothness of the volatility surface by method

We proceed by comparing the smoothness of the volatility surfaces. Table 6 displays our smoothness measures, as defined in Eq. 7. A smooth volatility surface has low second derivatives and thus a low smoothness measure. Conversely, a high smoothness measure is an indicator for a more curved volatility surface. The volatility surface of the three-dimensional kernel regression is the least smooth in our tests. On the other end, the one-dimensional kernel regression appears to be smoother than most of the alternatives. For the spline interpolation, there is an interesting divergence in the smoothness measure between its comparably rough S&P 500 volatility surface and its very smooth Euro Stoxx 50 surface. This goes hand in hand with the low errors of the spline interpolation in constructing the Euro Stoxx 50 surface. It seems like the spline interpolation is not able to cancel out all noise in the S&P 500 data, which produces a more curved volatility surface and higher fitting errors. At the same time, a smooth volatility surface does not necessarily imply low errors, as can be seen from the S&P 500 results for the Gram–Charlier expansion. That method might produce a smooth volatility surface, though it does not accurately reflect the information in observed option prices.

In short summary, the cross-validation of the data-rich environment recommends to use a volatility surface that was constructed based on the one-dimensional kernel regression. The spline interpolation is a good alternative for Euro Stoxx 50 data, though appears to not be able to cancel out all noise in the S&P 500 data. The volatility surface of the Gram–Charlier expansion should only be applied if one is interested in option-implied information from the at-the-money region and should be avoided when inferring conclusions about the tails. The three-dimensional kernel regression produces RMSEs that, relative to the one-dimensional kernel regression, are roughly 6 times larger for S&P 500 options and roughly 14 times larger for Euro Stoxx 50 options.

6.2 Intraday: data-poor environment

Here, we summarize key findings for the intraday analysis. We start with the highest time resolution (3 trades), followed by interpolations based on 4, 5 and 10 trades for distinct strikes.

Table 7 Intraday cross-validation RMSE

The findings from Table 7 (see Table 10 in “Appendix E” for MAE results) highlight that the one-dimensional kernel regression produces the most accurate overall volatility surface. The three-dimensional kernel regression is the second most accurate surface, though the performance differential is not as large as for the end-of-day set-up. In our sample, we find that the spline-based volatility surface suffers from frequent outliers, which increases the RMSE enormously. This is not surprising, because the spline interpolation is prone to over-fitting when it is applied to very few observations. Figure 4 visualizes the potential problem of over-fitting when constructing the spline-based volatility surface with 5 or less data points. As the MAE is less responsive to occasional large errors, which are a result of over-fitting, the spline interpolation performs better with respect to the MAE than with respect to the RMSE. It even shows a slightly lower MAE than the one-dimensional kernel regression for the Euro Stoxx 50 intraday set-up with 3 trades. The volatility surface of the Gram–Charlier expansion ranks always in the back, though with a much smaller error differential than in the end-of-day set-up. There does not appear to occur any over-fitting in the Gram–Charlier expansion, however in all of our tests, the general fit seems to be worse than for the kernel regression methods.

Fig. 4
figure 4

Sample for over-fitting in the spline interpolation. This figure visualizes the severity of the over-fitting in the spline interpolation methodology for a particular data point of the Euro Stoxx 50 sample, namely 30-day options on May 16th, 2012 (01:35 p.m.). The dots with a moneyness between 1.0 and 1.2 represent the 5 previous trade observations based on which the spline interpolation constructs the spline, which is depicted by the solid line. The dot with a moneyness around 0.6 marks the next trade observation

Comparing different sections of the implied volatility surface reveals that it is primarily the moneyness axis, and more precisely the left tail of shorter-term options that causes problems to all volatility surfaces. The volatility surface from the one-dimensional kernel regression generally copes best with these difficulties. With respect to both error measures, the volatility surface of the one-dimensional kernel regression implies the highest statistical accuracy across most regions of the volatility surface. Relatively speaking, that volatility surface is especially precise for longer maturity options and for left tail options of all maturities.

We now report the main findings for the intraday analysis that builds the volatility surface with 4, 5 and 10 price observations per maturity, respectively. Table 8 summarizes the results. As one would expect, adding more observations as input for the interpolations, improves the statistical accuracy for all methods. Clearly, the spline-based volatility surface benefits most, since the higher amount of observations reduces the tendency of over-fitting. Its RMSE differential towards the best method is falling rapidly as the volatility surface is based on more price observations. With the decreasing occurrence of over-fitting, spline interpolation quickly becomes the leading method with respect to the MAE and even outperforms the other approaches with respect to the RMSE in Euro Stoxx 50 options, when using 10 trades for the volatility surface. We conclude that the spline interpolation is a good alternative for the one-dimensional kernel regression in intraday set-ups, if an intermediate amount of price obervations is used and if the occasional occurrence of an over-fitted volatility smile is acceptable.

7 Conclusion

Our paper makes two contributions to the literature. First, we are the first to document that option-implied information, such as well-known forward-looking measures for variance, skewness and the variance risk premium, are sensitive to the choice of the underlying volatility surface. These findings hold more generally for any quantity that is extracted from the aggregation of option prices along the strike range.

Table 8 Intraday cross-validation errors for varying amount of trades used

Second, we implement a thorough statistical assessment of the accuracy of common parametric, semi-parametric and non-parametric approaches that the literature has entertained for constructing volatility surfaces at the end-of-day and intraday frequency. The methods under consideration are the Gram–Charlier expansion, the spline interpolation of Figlewski (2008), the three-dimensional kernel regression of OptionMetrics (2016) and the one-dimensional kernel regression with a linear tail extrapolation for the end-of-day setting. We have recorded the root mean squared error (RMSE) and mean absolute error (MAE) based on a leave-one-out cross-validation for each method and have compared the smoothness of the constructed end-of-day volatility surfaces directly. The test assets are S&P 500 and Euro Stoxx 50 options at the daily and intraday frequency over the past 14 years.

The result of our analysis is that the volatility surface that is constructed with the one-dimensional kernel regression is generally the most accurate one for end-of-day and intraday options on the S&P 500 and the Euro Stoxx 50. We recommend to use that volatility surface for extracting option-implied information at the daily and high-resolution intraday frequency. However, if an intermediate amount of observations is available, spline interpolation might be a preferable alternative, despite the occasional occurrence of over-fitting. The parametric Gram–Charlier volatility surface is in many cases too restrictive to approximate the true risk-neutral distribution and thus the true volatility surface accurately. The three-dimensional kernel regression of OptionMetrics (2016) performs comparatively well in the intraday analysis, but shows shortcomings in capturing the left tail of the option-implied volatility surface, especially at the end-of-day frequency.