1 Introduction

Concerns on cybersecurity threats are growing across all sectors of the global economy, as cyber risks have increased and cyber criminals have become progressively more dangerous.

In the last years, many stealthy and sophisticated cyber attacks targeted public and private sector organizations. The annual cost of cybercrime study conducted by Ponemon InstituteFootnote 1 confirms that, combined with the expanding threat landscape, organizations are noticing a steady rise in the number of security breaches: The average number has moved from 130 in 2017 to 145 in 2018 (+11% last year,+67% last 5 years). The impact of these cyber attacks on organizations, industries and society is relevant, as the total cost of cybercrime for each company has increased from $11.7 million in 2017 to a new high of $13.0 million in 2018 (+12% last year, +72% last 5 years). The 2018 study reports that the global average cost of a data breach is up 6.4% over the previous year to $3.86 million. The average cost for each lost or stolen record containing sensitive and confidential information has also increased by 4.8% year over year to $148. According to the Online Trust Alliance,Footnote 2 the number of cyber attacks worldwide doubled in 2017 to 160,000, although endemic underreporting means that the true figure could be as high as 350,000.

Despite the improvements in security countermeasures and practices, the statistics presented above highlight how cyber insurance represents an important tool for risk managers to mitigate the economic impact of cyber attacks. The demand for cyber insurance is expected to experience a huge growth, as people and companies become aware of the economic risk behind cyber attacks. However, the market for cyber insurance is undersized, mainly because insurance and reinsurance companies are still unprepared to offer coverage for such kind of risks. As KPMG highlights in one of its reportFootnote 3 on cyber insurance, insurers still need to improve their modeling capabilities with respect to these specific types of risk.

With the availability of new databases, academic research has started offering its contribution to understanding a particular class of cyber risk, namely data breaches. The literature on cyber risk and information security is plenty of papers in the area of information technology, while less work has been proposed in economics, finance and insurance. A comprehensive reference for an overview on the latter is Xu et al. (2018). In the study, the authors discuss a statistical analysis of a breach incident dataset obtained from the Privacy Rights ClearinghouseFootnote 4 and use stochastic processes to fit and predict inter-arrival times and breach sizes. The work includes a detailed review of prior contributions on the topic: Among others, it is worth mentioning Eling and Loperfido (2017) that analyzes the PRC Database with some actuarial insights, and the related studies on data breach statistical evaluations such as Maillart and Sornette (2010); Edwards et al. (2016); Wheatley et al. (2016, 2019, 2020). The PRC Database is also used in Farkas et al. (2019): The authors investigate the heterogeneity of the reported cyber claims using regression trees. The economical value of cyber risk is discussed in Eling and Wirfs (2019) where the authors focus the attention on cyber losses from an operational risk database and analyze the dataset with methods from statistics and actuarial science. As far as cyber insurance is concerned, a review of the available scientific approaches for the analysis of the cyber insurance market is Marotta et al. (2017), where the authors offer insights from both market and scientific perspectives.

In this paper, we go a step ahead in the understanding of data breaches by providing a dynamic analysis in which we find a causal relation between the intensity of data stolen and some metrics of the cryptocurrency market. Our original conjecture is as follows: If hackers perform data attacks to make a profit, they must somehow cash the attack. In this case, some cryptocurrencies offer the quickest and most anonymous way to monetize the attack. To test our postulate, we perform, on two distinct datasets of data breaches, a rigorous cointegration analysis between the daily number of stolen data, the daily Bitcoin’s price and the daily number of transactions in Bitcoin. In addition, we run Granger causality tests between the three variables. In both datasets, we find strong empirical evidence of the existence of a causal relationship between the number of data breaches and the Bitcoin-related variables both in the short run and in the long run.

To the best of our knowledge, we provide for the first time a set of easily measurable variables that explain data breaches. Thus, our findings offer new insights into the statistical estimation and forecasts of data breaches. This might guide insurers and reinsurers in the process of building new products that offer protection against such kinds of risk.

In the remaining of the paper, we proceed as follows: In Sect. 2, we summarize the methodology used. In Sect. 3, we describe the datasets used and present the results of the cointegration analysis and Granger causality tests. To quantify the impact of Bitcoin-related variables on data breaches, we perform an impulse response analysis and a variance decomposition of the forecasting errors. In Sect. 4, we conclude highlighting our results and suggesting new directions of research.

2 Methodology: cointegration analysis

Cointegration analysis has been widely used in finance and economics. Among the other applications, it has been employed to investigate the lead–lag relationship between spot and futures prices (see for instance Tse 1995) and the integration and efficiency of international bond market (Mills and Mills 1991; Ciner 2007). As for the applications involving economic data, many authors have relied on cointegration analysis to test the purchasing power parity (Pippenger 1993; Chen 1995) or to examine the expectations theory of the term structure of interest rates (see Campbell and Shiller 1987; Shea 1992, among others).

A d-dimensional time series \(\mathbf {Y}_t\) is said to be cointegrated of order (ab) if each series is integrated of order a,Footnote 5 i.e., each series becomes stationary after taking first differences a times, and there exists a linear combination of the d variables, \(\tilde{\mathbf{Y }}_t=\varvec{\beta }'\mathbf {Y}_t\) with \(\varvec{\beta }\) nonzero \(d \times 1\) vector, such that \(\tilde{\mathbf{Y }}_t\) is integrated of order \(a-b\). As with several economic time series, we are interested in the case in which \(a=b=1\), meaning that each of the one-dimensional components of \(\varvec{Y}_t\) has a unit root (i.e., it is integrated of order one), and their linear combination \(\tilde{\mathbf{Y }}_t\) is instead I(0). Hence, the starting point of cointegration analysis consists in establishing the order of integration of all the series of interest.

2.1 Unit root tests

The first step needed in a cointegration analysis involves testing if all the time series in \(\mathbf {Y}_t\) are integrated of order one. To this end, besides the usual augmented Dickey–Fuller (ADF) tests, we consider the ADF-GLS test of Elliott et al. (1996). The authors proposed a variant of the ADF test which involves an alternative method of handling the parameters pertaining to the deterministic term: These are estimated first via generalized least squares, and in a second stage an ADF regression is performed using the GLS residuals. The usual ADF tests are based on the t-statistic on \(\phi \) in the following regression:

$$\begin{aligned} \varDelta y_{i,t} = \mu _t + \phi y_{i,t-1} + \sum _{j=1}^p \gamma _i\varDelta y_{i,t-j} + \epsilon _{i,t}, \end{aligned}$$
(1)

where \(y_{i,t}\) is the ith component of \(\mathbf {Y}_t\). The null hypothesis of a unit root is \(\phi =0\), tested against the alternative \(\phi <0\). Therefore, large negative values of the test statistic lead to the rejection of the null. If all the components of \(\mathbf {Y}_t\) are found to be I(1), then the econometrician can move to the next step, i.e., the Johansen procedure. Its aim is to establish whether \(\mathbf {Y}_t\) is cointegrated and, if this is the case, how many cointegrating relations exist.

2.2 The Johansen procedure

A general vector autoregression (VAR) model with deterministic part \(\varvec{\mu }_t\) of the form:

$$\begin{aligned} \mathbf {Y}_t = \varvec{\mu }_t+\varvec{\varPi }_1 \mathbf {Y}_{t-1} + \cdots +\varvec{\varPi }_k \mathbf {Y}_{t-k} + \varvec{\varepsilon }_t , \quad t = 1, \ldots , T, \end{aligned}$$
(2)

can be rewritten using the following vector error correction (VECM) specificationFootnote 6:

$$\begin{aligned} \varDelta \mathbf {Y}_t = \varvec{\mu }_t+ \varvec{\varPi } \mathbf {Y}_{t-1}+ \varvec{\varGamma }_1 \varDelta \mathbf {Y}_{t-1} + \cdots + \varvec{\varGamma }_{k-1} \varDelta \mathbf {Y}_{t-k+1} + \varvec{\varepsilon }_t \end{aligned}$$
(3)

where

$$\begin{aligned} \varvec{\varGamma }_i= & {} - (\varvec{\varPi }_{i+1} + \cdots + \varvec{\varPi }_k), \quad i = 1, \ldots , k-1,\\ \varvec{\varPi }= & {} -(\mathbf {I} - \varvec{\varPi }_1 - \cdots - \varvec{\varPi }_k), \end{aligned}$$

and \(\varDelta \) is the first difference operator, i.e., \(\varDelta \mathbf {Y}_t = \mathbf {Y}_t-\mathbf {Y}_{t-1}\).

We implement the tests developed by Johansen (1991) to test the hypothesis that \(\mathbf {Y}_t\) is cointegrated of order (1, 1). Such hypothesis involves r, the rank of \(\varvec{\varPi }\). If \(r\le d-1\), one can write \(\varvec{\varPi }=\varvec{\alpha }\varvec{\beta }'\) where \(\varvec{\alpha }\) and \(\varvec{\beta }\) are \(d \times r\) matrices. The matrix \(\varvec{\beta }\) contains r linear cointegration parameter vectors, whereas \(\varvec{\alpha }\) is a matrix consisting of d error-correction parameter vectors (the so-called loadings). The maximum likelihood estimate of \(\varvec{\alpha }\) is obtained using the OLS regression of \(\varDelta \mathbf {Y}_t\) on \(\varDelta \mathbf {Y}_{t-1}\), \(\ldots \), \(\varDelta \mathbf {Y}_{t-k+1}\) and a constant. Denote by \(\varvec{\hat{\epsilon }}_{0t}\) the residuals. Similarly, the maximum likelihood estimate of \(\varvec{\beta }\) can be obtained from the OLS regression of \(\mathbf {Y}_t\) on \(\varDelta \mathbf {Y}_{t-1}\), \(\ldots \), \(\varDelta \mathbf {Y}_{t-k+1}\) and a constant. In this case, denote by \(\varvec{\hat{\epsilon }}_{1t}\) the residuals. Given the residuals, it is possible to calculate for \(j=0,1\) the matrices \(\varvec{S}_{ij}=T^{-1}\sum _{t=1}^T \varvec{\hat{\epsilon }}_{it} \varvec{\hat{\epsilon }}_{jt}'\). Let \(\hat{\lambda }_1>\cdots >\hat{\lambda }_d\) be the eigenvalues obtained from solving the eigenvalue system \(\left| \lambda \varvec{S}_{11}-\varvec{S}_{10} \varvec{S}_{00}^{-1}\varvec{S}_{01}\right| =0\), and \((\varvec{\hat{\psi }}_{1},\ldots ,\varvec{\hat{\psi }}_{d})\) the corresponding eigenvectors. The estimate for \(\varvec{\beta }\), \(\hat{\varvec{\beta }}\), is obtained as the juxtaposition of the eigenvectors associated with the r largest eigenvalues, and the one for \(\varvec{\alpha }\) is obtained as \(\hat{\varvec{\alpha }}=\varvec{S}_{01}\hat{\varvec{\beta }}\). Two Johansen’s maximum likelihood tests, the maximal eigenvalue test and the trace test, can then be used to determine the number of cointegration vectors. The statistic from the maximal eigenvalue test for the null hypothesis of r cointegration vectors against the alternative of \(r+1\) cointegration vector is \(\hat{\lambda }_{\text {max}}=-T\log (1-\hat{\lambda }_{r+1})\). The trace test statistic for the null hypothesis of at most r cointegration vectors is \(\hat{\lambda }_{\text {trace}}=-T\sum _{i=r+1}^d\log (1-\hat{\lambda }_{i})\). If the results are consistent with the hypothesis of at least one cointegration vector, one then uses the maximum likelihood method to test the hypotheses regarding the restriction on \(\varvec{\beta }\).

3 The relation between data breaches and Bitcoin metrics

3.1 Data

In this paper, we look for short-term and/or long-term relations between data breaches and Bitcoin-related variables. More specifically, we perform two distinct analyses based on two publicly available databases of data breaches.

The first database is taken from the Chronology of Data Breaches provided by the Privacy Rights ClearinghouseFootnote 7 (PRC). The PRC dataset is publicly available and constantly updated on the PRC Web site and has been used in other recent investigations (see for instance, Eling and Loperfido 2017; Maillart and Sornette 2010; Edwards et al. 2016; Wheatley et al. 2016; Farkas et al. 2019; Wheatley et al. 2019, 2020).

The second dataset is obtained from the Breach Level Index (BLI) Data Breach Database, a centralized, global database of data breaches with calculations of their severity based on multiple factors. The BLI tracks publicly disclosed breaches and also allows organizations to do their own risk assessment since, on the basis of a few simple inputs, it calculates their risk score, overall breach severity level, and summarizes possible actions to reduce the risk score. The dataset has been downloaded from the Web site of Gemalto, part of the Thales Group, one of the world leaders in digital security.Footnote 8

We mention that the databases do not necessarily contain all of the hacking breach events because there may be unreported ones. The exposure to this type of risk is not easy to be tracked, since the population of potential victims that would report to registers is not stable through time or, at least not known in opposition to the type of information that comes from an insurer that might have a clearer view on the exposure, for example. Many organizations are not aware they have been breached or they are not required to report it according to the reporting laws. PRCs Chronology is limited to those reported in the USA. If a data breach affects individuals in other countries, it is included only if individuals in the USA are also affected. The data contain only the number of records affected by data breaches and do not include financial losses.

The PRC database is organized in industries and type of attack. The BLI database also includes the country interested by the breach.

As for the Bitcoin-related variables, we are mainly interested in the daily price and the daily number of transactions of Bitcoins. Our data source is taken from DataHub,Footnote 9 a project by Datopian and Open Knowledge International that provides publicly available high-quality datasets. As for the cryptocurrency, we focus on the historical prices (USD), on the number of transactions happening on the public blockchain during a given day.

The period of time we refer to, considering the dimensions of the datasets and in order to make them comparable, goes from January 1, 2013, to December 31, 2018. The work aims at investigating the idea that some breaches are treated by criminal organizations to make money: For this reason, the analysis focuses on malicious breaches related to hackers while negligent breaches and other sub-categories of malicious breaches included in the databases (i.e., insider, payment card fraud, ransomware, accidental unknown) have been ignored.

When researchers apply cointegration, both the sample size and the time span are relevant. Hakkio and Rush (1991) argue that cointegration is a long-run concept and hence long spans of data are needed for cointegration tests to have power and that gains from using more frequently sampled observation while keeping the same time span are “more apparent than real.” The Monte Carlo study of Zhou (2001), while confirming the importance of the time span, reveals that increasing the data frequency may yield substantial power gains. Since the considered time series on data breaches goes back only to 2013 and is obviously available only at a daily frequency, in the present paper, we use the longest span at the highest possible frequency. We believe that the time span is long enough to have reliable results regarding the short-term and long-term relations between the variables of interest.

The dynamic behavior of data breaches is represented by integer-valued time series displaying an unusual pattern which resembles a point process. Figure 1 plots the two time series generated with the datasets used. The figure clearly shows the impulsive nature intrinsic in time series of this kind. In such cases, standard cointegration analysis cannot be applied directly to the time series.Footnote 10 Our empirical strategy to overcome this problem is to extrapolate from the original dataset a new latent time series that generated the observed pattern of data breach. Then, we perform the cointegration analysis using the extrapolated latent time series instead of the observed data. This approach is quite common in cointegration analysis. We refer to Niu and Melenberg (2014) for cointegration analysis which uses latent factors.

We report additional details in the subsequent subsection.

3.2 Time series of counts and their intensities

Integer-valued GARCH models (henceforth INGARCH) constitute a popular class of models for time series of counts. Although the name might suggest some sort of affinity with the well-known GARCH models, INGARCHs are auto-regressive moving average processes constructed to model the dynamics of phenomena that are discrete. An INGARCH model allows the conditional expected value of a discrete random variable (or some transformation) that models a countable phenomenon to depend on its previous values and on previous observations of the phenomenon itself. As a general discussion of such processes is far beyond the scope of the present paper, we restrict ourselves to the presentation of model used to extract the latent time series and refer to Weiß (2018) for an outstanding introduction on INGARCH models.

Fig. 1
figure 1

Plot of the observed time series. Left panel: PRC dataset. Right panel: BLI dataset. All plots are in logarithmic scale for visualization purposes

Let \(B_t\) be the observed breach size at time t. We assume that the conditional distribution of \(B_t|(B_0,\ldots ,B_{t-1})\) of the observed breach size given the previous realizations follows a Negative Binomial distribution \(NB\left( s,p_t = \frac{\mu _t}{s + \mu _t}\right) \) with probability mass

$$\begin{aligned} P(B_t = k) = \frac{\varGamma (s+k)}{\varGamma (k+1)\varGamma (s)} \left( \frac{\mu _t}{s + \mu _t}\right) ^k \left( \frac{s}{s + \mu _t}\right) ^s, \end{aligned}$$

so that we have \(E\left( B_t|(B_0,\ldots ,B_{t-1})\right) =\mu _t\) and \(Var\left( B_t|(B_0,\ldots ,B_{t-1})\right) = \mu _t + \frac{\mu _t^2}{s}\). In addition, we allow \(\lambda _t = \log (\mu _t)\) to follow an ARMA process of order (pq) of the following type:

$$\begin{aligned} \lambda _t = a_0 + \sum _{i=1}^p a_i \log \left( B_{t-i}+1\right) + \sum _{j=1}^q b_j \lambda _{t-j}. \end{aligned}$$

This model appeared the first time in Fokianos and Tjøstheim (2011). It is a generalization of the basic INGARCH model that allows for both positive and negative serial correlation. The choice of a logarithmic scale for the observed time series is needed to ensure the positivity of the conditional expectation \(\mu _t\). Fokianos and Tjøstheim (2011) also show that adding a constant to the logarithmic transformation of the time series does not alter the estimation process. Although originally proposed in association with a Poisson distribution for the observed time series, the strong over-dispersion present in the breach data motivates our choice of a negative binomial distribution.

Fig. 2
figure 2

Estimated logarithmic conditional expectations of the INGARCH model. Left panel: PRC dataset. Right panel: BLI dataset

We use maximum likelihood to fit the INGARCH model with the main goal to extrapolate the time series of (logarithmic) conditional expected values \(\lambda _t\). We plot the resulting extrapolated time series in Fig. 2 and observe that the dynamic patterns of the latent time series are well suited for standard cointegration analysis. Thus, in what follows we will use the extrapolated time series to question cointegration by using \(\lambda _t\) instead of the original dataset. Accordingly, we specify the vector in (1) and (2) as \(\mathbf {Y}_t = \left( C_t,P_t,\lambda ^h_t\right) '\), where \(C_t\) and \(P_t\) are, respectively, the logarithm of the daily number of transactions in Bitcoin and the daily Bitcoin’s price and \(\lambda ^h_t\) as the logarithmic conditional expectation \(h=\text {PRC, BLI}\).

3.3 Empirical evidence

The standard cookbook of cointegration analysis requires first a preliminary test to check the order of integration of the time series under investigation. According to the methodology explained in Sect. 2, to check whether or not each time series is integrated of order one, we perform the ADF-GLS test based on regressions (1) for each variable and each first-order difference and report the results in Table 1. From Table 1, we conclude that the time series under investigation are integrated with order of integration one. Indeed, the unit root tests performed on the levels of each variable under consideration lead to not rejecting the null hypothesis of \(\phi = 0\), suggesting that the time series has a unit root, while the first-order difference leads to the rejection of the null hypothesis, meaning that stationarity is achieved after applying the first difference operator. In what follows, we discuss the results from vector error-correction model for each of the two datasets.

Table 1 Unit root tests by means of ADF-GLS

3.3.1 PRC dataset

Having established that all the series involved in the analysis are I(1), here we determine whether there exists a cointegration relation between the variables. The Johansen’s tests are based on the rank of the matrix \(\varvec{\varPi }\) of equation, r. The null hypothesis \(r = 0\) implies no cointegration, while \(r>0\) (\(r=1,\ldots ,d-1\)) means that there are r cointegrating relations. In the latter case, r distinct linear combinations of the variables—the cointegrating vectors—represent the long-run relation between the components of the multivariate time series. Table 2 presents the Johansen’s tests on the PRC dataset, where we follow the Box–Jenkins’ model selection technique to select the optimal order in the VAR model (3), identified to be 8 according to the Bayesian information criterion (BIC). Both the trace test and the maximal eigenvalue test agree to accept the null hypothesis \(r = 2\).

Table 2 Johansen’s cointegration tests for the PRC dataset

Having identified \(r=2\) the rank of \(\varvec{\varPi }\), we proceed by estimating the vector error-correction model with one cointegrating vector. Table 3 reports the resulting vector error-correction model for data breaches of PRC dataset, the daily number of transactions of Bitcoin and the Bitcoin’s price, from which we identify both a short-run and a long-run relation between the lagged variables \(\varDelta C_{t}\) and \(\varDelta \lambda ^{\mathrm{PRC}}_{t}\) and a long-run relation between \(\varDelta P_t\) and \(\varDelta \lambda ^{\mathrm{PRC}}_{t}\).

Table 3 Vector error-correction estimation. PRC dataset\(^{\mathrm{a,b}}\)

In Table 3, the coefficients of the vector auto-regression marked as underlined highlight the short-run relation between the lagged logarithm of conditional expectations of data breaches and the lagged logarithm of the number of transactions in Bitcoin. More precisely, almost all the lagged variables \(\varDelta C\) have a strong negative impact on the lagged conditional expectations of data breaches today, as one may observe from the value and the highly significance of the regression coefficients. This suggests that the number of transactions in Bitcoin might be a good predictor for data breaches. The intuition behind this result is that hackers prepare themselves to monetize the data attack (either by selling the data or by extorting money to the legitimate data proprietor) some days before, by operating on the Bitcoin market.

Furthermore, when looking at the equation for the Bitcoin transaction in the short-run component of the estimated VECM model, we find that almost all lagged variables \(\varDelta \lambda ^{\mathrm{PRC}}\) are statistically significant when regressing \(\varDelta C_t\). This result, coupled with the one regarding the equation for \(\varDelta \lambda _t^{\mathrm{PRC}}\), implies that there is a lead–lag relation between data breaches and transactions in Bitcoins and the relation is bidirectional. Hence, our results confirm that some of the movements in cryptocurrency markets depend on illegalities (in our case cyber attacks). Also looking at Table 3, we find no short-run link between lagged conditional expectations of data breaches and Bitcoin price. Although the number of transactions has a strong impact on data breaches, this link is not necessarily reflected in a short-run impact on Bitcoin price.

The long-run relation between the variables under investigation is described by the existence of two cointegrating variables \(\varPsi ^{\mathrm{BRC},1}_t = P_t - 0.08233 \lambda ^{\mathrm{PRC}}_t\) and \(\varPsi ^{\mathrm{BRC},2}_t = C_t - 55.434 \lambda ^{\mathrm{PRC}}_t\), obtained by the estimation procedure detailed in Sect. 2.2. The double-underlined coefficient in Table 3 highlights that changes of \(\varDelta \lambda _t\) are affected with very high statistical significance by the second cointegrating variable. Since both the logarithm of the number of transactions and the logarithm of conditional expectations of data breaches enter the second cointegrating vector, we highlight that the short-run impact found is likely to persist in the long time. Our conjecture about the intuition behind this long-term link between transactions in Bitcoin and data breaches is as follows. On the one hand, once the hackers have monetized the breach, they possess a bunch of Bitcoins that will later be used in some different contexts. On the other hand, a remunerative data breach creates incentives to prepare more cyber attacks, which in turn create the needs of more transactions in Bitcoin. We also find high statistical significance of both cointegrating variables in the equation for the change in Bitcoin prices. This implies that the effects of Bitcoin metrics and conditional expectations of data breach will impact also on Bitcoin returns in the long run. For instance, the negative and significant coefficient associated with \(\varPsi ^1\) indicates that if the difference between the linear combination of Bitcoin price and the logarithm of the conditional expectations of data breaches is positive in one period, the price will fall during the next period to restore equilibrium, and vice versa.

Summarizing the empirical evidence discussed in this section, the change of expected number of data breaches recorded in the PRC dataset is statistically (and negatively) influenced in the short run by its lagged variables \(\varDelta \lambda ^{\mathrm{PRC}}_{t_i}\), \(i=1,\ldots ,3\),Footnote 11 and by the lagged levels of the number of transactions in Bitcoin some days before the attack. In the long run, deviations from the cointegration link, whose components are data breaches and transactions of Bitcoin, cause changes in the data breach intensities.

3.3.2 BLI dataset

The cointegration analysis of the BLI dataset fully confirms the existence of a strong, statistically significant, link between data breaches and number of transactions in Bitcoin, both in the short run and in the long run. Table 4 reports Johansen’s cointegration tests and highlights the existence of two cointegrating vectors. The optimal order for the VAR model is once again 8 and has been selected according to Box–Jenkins’ technique. The strong significance of underlined coefficients in Table 5 indicates the short-term relation. More specifically, the lagged variables \(\varDelta C_{t-1}\), \(\varDelta C_{t-2}\) and \(\varDelta C_{t-6}\) impact heavily on the lagged value of the logarithmic conditional expectations of data breaches. We also find statistical significance for the reverse relation, especially in the variables \(\varDelta \lambda ^{\mathrm{BLI}}_{t-6}\) and \(\varDelta \lambda ^{\mathrm{BLI}}_{t-7}\). This confirms the intuition provided in the previous section for which data breaches have a statistical effect in the number of transactions in Bitcoin. The analysis also confirms the lack of a significant short-term relation between data breaches and price of Bitcoin.

Table 4 Johansen’s Cointegration Tests for BLI dataset
Table 5 Vector error-correction estimation. BLI dataset\(^{\mathrm{a,b}}\)

The two cointegrating variables \(\varPsi ^{\mathrm{BLI},1}_t = P_t -0.524 \lambda ^{\mathrm{BLI}}_t\) and \(\varPsi ^{\mathrm{BLI},2}_t = C_t - 354.95 \lambda ^{\mathrm{BLI}}_t\), obtained by inserting the estimated cointegrating of Table 5 into the VECM equation (3), describes the long-term relation among the three variables under investigation. We see that changes in logarithmic expectations of data breaches are due to changes in lagged logarithms of the number of transactions, the cointegrating variable itself (double-underlined coefficient of Table 5) and lagged logarithm of data breaches (bold coefficients in Table 5). This highlights once again the autoregressive structure of data breaches. Once again, the short-term relation between Bitcoin metrics and data breach is reflected in a long-run impact in the Bitcoin rate of return.

3.4 Granger causality tests

In this section, we perform a series of Granger causality tests (Granger 1969, 1988; Sims 1972) to provide further evidence on the lead–lag relationships between data breaches and Bitcoin metrics. Granger causality test is a standard tool for uncovering lead–lag relationships among economic variables. With reference to the most successful applications, we mention Chan (1992); Abhyankar (1998) among others. However, it is worth mentioning that novel methodologies for determining time-dependent lead–lag relations based on optimal thermal paths appeared recently, as in Meng et al. (2017); Xu et al. (2017).

Table 6 Granger causality tests for bivariate VARs
Table 7 Granger causality tests for trivariate VARs

We use first differences of the variables \(\left( C_t,P_t,\lambda ^h_t\right) '\) and perform the tests on both bivariate and trivariate VAR models. In the case of bivariate models, testing for instance the null that \(\varDelta C_t\) does not Granger-cause \(\varDelta \lambda _t\), amounts to estimating the VAR comprising the two variables, and testing the null that the coefficients associated with the first variable are all zero in the equation for \(\varDelta \lambda _t\), against the alternative that at least one is different than zero. In the case of the trivariate model (like the one in Eq. (3)), we perform the same test, but the VAR model includes also the remaining variable. We choose the order of autoregression according to the best Bayesian information criterion. We report the results in Tables 6 and 7. Underlined p values indicate rejection of the null hypothesis and thus Granger causality between the variables under investigation. We observe that both tables agree in all the cases under consideration but in one. Moreover, the results are fully consistent with the VECM models previously considered.

3.5 Impulse response analysis

Having discovered a causal link between Bitcoin metrics and conditional expectations of data breaches recorded in two different databases, we analyze the impulse response function (IRF) of the estimated VECM to evaluate the response of conditional expectations of data breaches with respect to unexpected shocks of the Bitcoin-related variables. The possibility of studying impulse responses in the context of cointegration analysis and the feasibility of combining the two approaches has been demonstrated by Lütkepohl and Reimers (1992).

Fig. 3
figure 3

Impulse response point estimates and 95% confidence bands. a: Shock variable \(P_t\), response variable \(\lambda ^{\mathrm{PRC}}_t\). b: Shock variable \(C_t\), response variable \(\lambda ^{\mathrm{PRC}}_t\). c: Shock variable \(P_t\), response variable \(\lambda ^{\mathrm{BLI}}_t\). d: Shock variable \(C_t\), response variable \(\lambda ^{\mathrm{BLI}}_t\)

In what follows, the IRF of variable i to shock j is defined as the sequence of the elements in the ith row and jth column of the matrices \(\{\varvec{\varPhi }_k\}_{k=0,1,\ldots }\). Assuming that the error term in (2)–(3) can be written as a linear combination of mutually uncorrelated shocks with unit variance, i.e., \(\varvec{\varepsilon }_t=\varvec{H}\varvec{u}_t\), these matrices are obtained as \(\varvec{\varPhi }_k=\frac{\partial \varvec{Y}_t}{\partial \varvec{u}_{t-k}} =\frac{\partial \varvec{Y}_t}{\partial \varvec{\varepsilon }_{t-k}}\varvec{H}\). The matrix \(\varvec{H}\) is assumed to be lower triangular, and its estimate is obtained as the Cholesky decomposition of the estimated variance covariance matrix of \(\varvec{\varepsilon }_t\) (see Lütkepohl 2006, Chapter 9). Choosing \(\varvec{H}\) to be lower triangular implies that, in general, the ordering of variables in the vector \(\varvec{Y_t}\) is important. Since we are interested in the effects of Bitcoin-related variables on data breaches, we put the latter as the last variable in the ordering, so that in this setting the variable Delta \(\lambda \) responds instantaneously to shocks associated with the remaining two variables. However, we have verified that, in our case, a different order does not change much the estimated impulse response function.

Figure 3 reports point estimates and 95% confidence intervals of the response of \(\lambda ^h\), \(h = PRC, BLI\), with respect to exogenous shocks of Bitcoin’s price and number of transactions of one standard deviation, for a period of 120 days, although the impact of the \(C_t\) variable seems stronger in the short run (up to about two weeks), especially for the BLI dataset. For both datasets, the impulse response estimates associated with the \(C_t\) variable cross the zero axis more often than the ones associated with the Bitcoin’s price do. More specifically, Panels (a) and (b) refer to changes of PRC data breaches, while panels (c) and (d) refer to changes of BLI data breaches due to shocks of Bitcoin-related variables. We note that the logarithm of the number of transactions in Bitcoins has a relevant impact on the future data breaches in both datasets. The size of the response is significantly different from zero, and the phenomenon continues to persist in the long run.

The confidence intervals associated with the points estimates are very tight in the BLI dataset, indicating low variability in the estimates, less tight in the PRC dataset. Less relevant is the response with respect to unexpected shocks of the Bitcoin’s price, at least until 15 days. Since after about 15 days the confidence interval is above the zero line, the impact of a shock of the Bitcoin’s price on data breaches becomes positive and significant in the case of the PRC dataset. On the other hand, a shock of the Bitcoin’s price has a negative and significant impact after about 15 days on the data breaches from the BLI dataset. The effect of a shock of Bitcoin’s price starts declining after 10 days and almost vanishes by the 30th day in both datasets.

3.6 Variance decomposition of forecasting errors

In this section, we offer further details on the contribution of each Bitcoin-related variable to the forecast power of the estimated VAR model.Footnote 12

Table 8 Variance decomposition of forecast errors of data breaches intensities

The variance of forecast error after h steps for variable i, \(\omega _{h;i}^2\), is defined as the element in position i in the main diagonal of the matrix \(\sum _{k=0}^h \varvec{\varPhi }_k\varvec{\varPhi }_k'\). The contribution of variable j to the variance of forecast error after h steps for variable i is calculated as \(\mathrm {VDFE}_{h;i,j}=\frac{\sum _{k=0}^h(\phi _{k;i,j})^2}{\omega _{h;i}^2}\), where \(\phi _{k;i,j}\) is the element in the ith row and jth column of matrix \(\varvec{\varPhi }_k\).

Table 8 presents the variance decomposition of forecasting errors (VDFE) associated with the VECM estimations presented in Tables 3 and 5. VDFE is a classical tool used by econometricians to understand the impact of exogenous shocks of a given (independent) variable on the forecasting errors of a different (dependent) variable. In other words, VDFE helps us understand the contribution of Bitcoin-related variables in the forecasting power of our model for data breaches.

Looking at the results presented in Table 8, we see that, in the PRC dataset, the contribution of number of transactions in Bitcoin in explaining the variance of forecasting errors of data breaches is irrelevant when the forecasting horizon is short (up to 5 days). However, for forecasting horizon greater than 5 days an exogenous shock in the independent variable is able to explain up to 4% of the variance of the forecasting vector. The contribution of Bitcoin’s price seems to be to be even more important, as it can explain up to 7.3% of the total variance. In total, for the PRC dataset, Bitcoin metrics are able to explain more than 11.3% of the total variance of the forecasting errors. The VDFE of the BLI dataset displays a reduced relevance of Bitcoin metrics. The number of transactions in Bitcoins is able to explain at most 1.1% of the entire variance and Bitcoin’s prices only up to 0.47%.

4 Concluding remarks

In this paper, we uncover the strong, bidirectional, relation between data breaches and Bitcoin-related variables. Our analysis suggests that in the short run the lagged values of the number of transactions in Bitcoin have a strong negative impact on data breaches today. In the long run, the existence of a cointegrating vector including all variables under investigation implies that the short-run relation will persist in the long run. Moreover, we find almost identical results on two different datasets, confirming the robustness of our result. The impulse response analyses highlight the relevant quantitative impact of both the number of transactions and the Bitcoin price on future data breaches, while the variance decomposition of the forecasting errors suggests that the same variable can explain up to 5% of the variability.

Our results might open up new research directions. First, on the econometric side, a deeper understanding of the relation between cyber risk and cryptocurrencies might be helpful. Indeed, one might wonder whether the relations found in this paper extend to other class of cyber risk and to different cryptocurrencies. With the availability of new datasets of cyber attacks, such analyses are going to become feasible in the near future. Second, on the actuarial side, the next step is to create a risk model for data breaches which includes exogenous factors as cryptocurrency-related variables. This might improve the forecasting procedures and give insurers a better understanding of cyber risk. An analysis of the impact of our findings on classical actuarial measures might also be of great interest. On the other hand, the relevance of cryptocurrencies in an international financial context is increasing. In this scenarios, understanding the connections among cryptocurrencies, macroeconomic variables and other factors such as data breaches is surely an interesting point to further explore.