On the determinants of data breaches: A cointegration analysis

De Giovanni, Domenico; Leccadito, Arturo; Pirra, Marco

doi:10.1007/s10203-020-00301-y

On the determinants of data breaches: A cointegration analysis

Open access
Published: 05 September 2020

Volume 44, pages 141–160, (2021)
Cite this article

Download PDF

You have full access to this open access article

Decisions in Economics and Finance Aims and scope Submit manuscript

On the determinants of data breaches: A cointegration analysis

Download PDF

2365 Accesses
1 Citation
Explore all metrics

Abstract

Cyber risks and particularly data breaches constitute one of the new frontiers of risk modeling for insurers across the world. We use the cointegration methodology to uncover the relation between data breaches and Bitcoin-related variables. We perform our analyses on two different datasets of data breaches. In both cases, we provide statistical evidence of a bidirectional lead–lag relation in the short run between the variables under investigation. Moreover, the existence of a cointegrating vector suggests that this relation is likely to persist in the long run. To evaluate the quantitative implications of the relations found, we complement the study with Granger causality tests, impulse response analyses and variance decompositions of the forecasting errors.

Cyber-Attacks, Cryptocurrencies and Cyber Security

On cointegration and cryptocurrency dynamics

Article Open access 17 February 2021

Georg Keilbar & Yanfen Zhang

Cybersecurity hazards and financial system vulnerability: a synthesis of literature

Article 18 August 2020

Md. Hamid Uddin, Md. Hakim Ali & Mohammad Kabir Hassan

1 Introduction

Concerns on cybersecurity threats are growing across all sectors of the global economy, as cyber risks have increased and cyber criminals have become progressively more dangerous.

In the last years, many stealthy and sophisticated cyber attacks targeted public and private sector organizations. The annual cost of cybercrime study conducted by Ponemon Institute^{Footnote 1} confirms that, combined with the expanding threat landscape, organizations are noticing a steady rise in the number of security breaches: The average number has moved from 130 in 2017 to 145 in 2018 (+11% last year,+67% last 5 years). The impact of these cyber attacks on organizations, industries and society is relevant, as the total cost of cybercrime for each company has increased from $11.7 million in 2017 to a new high of $13.0 million in 2018 (+12% last year, +72% last 5 years). The 2018 study reports that the global average cost of a data breach is up 6.4% over the previous year to $3.86 million. The average cost for each lost or stolen record containing sensitive and confidential information has also increased by 4.8% year over year to $148. According to the Online Trust Alliance,^{Footnote 2} the number of cyber attacks worldwide doubled in 2017 to 160,000, although endemic underreporting means that the true figure could be as high as 350,000.

Despite the improvements in security countermeasures and practices, the statistics presented above highlight how cyber insurance represents an important tool for risk managers to mitigate the economic impact of cyber attacks. The demand for cyber insurance is expected to experience a huge growth, as people and companies become aware of the economic risk behind cyber attacks. However, the market for cyber insurance is undersized, mainly because insurance and reinsurance companies are still unprepared to offer coverage for such kind of risks. As KPMG highlights in one of its report^{Footnote 3} on cyber insurance, insurers still need to improve their modeling capabilities with respect to these specific types of risk.

With the availability of new databases, academic research has started offering its contribution to understanding a particular class of cyber risk, namely data breaches. The literature on cyber risk and information security is plenty of papers in the area of information technology, while less work has been proposed in economics, finance and insurance. A comprehensive reference for an overview on the latter is Xu et al. (2018). In the study, the authors discuss a statistical analysis of a breach incident dataset obtained from the Privacy Rights Clearinghouse^{Footnote 4} and use stochastic processes to fit and predict inter-arrival times and breach sizes. The work includes a detailed review of prior contributions on the topic: Among others, it is worth mentioning Eling and Loperfido (2017) that analyzes the PRC Database with some actuarial insights, and the related studies on data breach statistical evaluations such as Maillart and Sornette (2010); Edwards et al. (2016); Wheatley et al. (2016, 2019, 2020). The PRC Database is also used in Farkas et al. (2019): The authors investigate the heterogeneity of the reported cyber claims using regression trees. The economical value of cyber risk is discussed in Eling and Wirfs (2019) where the authors focus the attention on cyber losses from an operational risk database and analyze the dataset with methods from statistics and actuarial science. As far as cyber insurance is concerned, a review of the available scientific approaches for the analysis of the cyber insurance market is Marotta et al. (2017), where the authors offer insights from both market and scientific perspectives.

In this paper, we go a step ahead in the understanding of data breaches by providing a dynamic analysis in which we find a causal relation between the intensity of data stolen and some metrics of the cryptocurrency market. Our original conjecture is as follows: If hackers perform data attacks to make a profit, they must somehow cash the attack. In this case, some cryptocurrencies offer the quickest and most anonymous way to monetize the attack. To test our postulate, we perform, on two distinct datasets of data breaches, a rigorous cointegration analysis between the daily number of stolen data, the daily Bitcoin’s price and the daily number of transactions in Bitcoin. In addition, we run Granger causality tests between the three variables. In both datasets, we find strong empirical evidence of the existence of a causal relationship between the number of data breaches and the Bitcoin-related variables both in the short run and in the long run.

To the best of our knowledge, we provide for the first time a set of easily measurable variables that explain data breaches. Thus, our findings offer new insights into the statistical estimation and forecasts of data breaches. This might guide insurers and reinsurers in the process of building new products that offer protection against such kinds of risk.

In the remaining of the paper, we proceed as follows: In Sect. 2, we summarize the methodology used. In Sect. 3, we describe the datasets used and present the results of the cointegration analysis and Granger causality tests. To quantify the impact of Bitcoin-related variables on data breaches, we perform an impulse response analysis and a variance decomposition of the forecasting errors. In Sect. 4, we conclude highlighting our results and suggesting new directions of research.

2 Methodology: cointegration analysis

Cointegration analysis has been widely used in finance and economics. Among the other applications, it has been employed to investigate the lead–lag relationship between spot and futures prices (see for instance Tse 1995) and the integration and efficiency of international bond market (Mills and Mills 1991; Ciner 2007). As for the applications involving economic data, many authors have relied on cointegration analysis to test the purchasing power parity (Pippenger 1993; Chen 1995) or to examine the expectations theory of the term structure of interest rates (see Campbell and Shiller 1987; Shea 1992, among others).

A d-dimensional time series $\mathbf {Y}_t$ is said to be cointegrated of order (a, b) if each series is integrated of order a,^{Footnote 5} i.e., each series becomes stationary after taking first differences a times, and there exists a linear combination of the d variables, $\tilde{\mathbf{Y }}_t=\varvec{\beta }'\mathbf {Y}_t$ with $\varvec{\beta }$ nonzero $d \times 1$ vector, such that $\tilde{\mathbf{Y }}_t$ is integrated of order $a-b$. As with several economic time series, we are interested in the case in which $a=b=1$, meaning that each of the one-dimensional components of $\varvec{Y}_t$ has a unit root (i.e., it is integrated of order one), and their linear combination $\tilde{\mathbf{Y }}_t$ is instead I(0). Hence, the starting point of cointegration analysis consists in establishing the order of integration of all the series of interest.

2.1 Unit root tests

The first step needed in a cointegration analysis involves testing if all the time series in $\mathbf {Y}_t$ are integrated of order one. To this end, besides the usual augmented Dickey–Fuller (ADF) tests, we consider the ADF-GLS test of Elliott et al. (1996). The authors proposed a variant of the ADF test which involves an alternative method of handling the parameters pertaining to the deterministic term: These are estimated first via generalized least squares, and in a second stage an ADF regression is performed using the GLS residuals. The usual ADF tests are based on the t-statistic on $\phi $ in the following regression:

$$\begin{aligned} \varDelta y_{i,t} = \mu _t + \phi y_{i,t-1} + \sum _{j=1}^p \gamma _i\varDelta y_{i,t-j} + \epsilon _{i,t}, \end{aligned}$$

(1)

where $y_{i,t}$ is the ith component of $\mathbf {Y}_t$. The null hypothesis of a unit root is $\phi =0$, tested against the alternative $\phi <0$. Therefore, large negative values of the test statistic lead to the rejection of the null. If all the components of $\mathbf {Y}_t$ are found to be I(1), then the econometrician can move to the next step, i.e., the Johansen procedure. Its aim is to establish whether $\mathbf {Y}_t$ is cointegrated and, if this is the case, how many cointegrating relations exist.

2.2 The Johansen procedure

A general vector autoregression (VAR) model with deterministic part $\varvec{\mu }_t$ of the form:

$$\begin{aligned} \mathbf {Y}_t = \varvec{\mu }_t+\varvec{\varPi }_1 \mathbf {Y}_{t-1} + \cdots +\varvec{\varPi }_k \mathbf {Y}_{t-k} + \varvec{\varepsilon }_t , \quad t = 1, \ldots , T, \end{aligned}$$

(2)

can be rewritten using the following vector error correction (VECM) specification^{Footnote 6}:

$$\begin{aligned} \varDelta \mathbf {Y}_t = \varvec{\mu }_t+ \varvec{\varPi } \mathbf {Y}_{t-1}+ \varvec{\varGamma }_1 \varDelta \mathbf {Y}_{t-1} + \cdots + \varvec{\varGamma }_{k-1} \varDelta \mathbf {Y}_{t-k+1} + \varvec{\varepsilon }_t \end{aligned}$$

(3)

where

$$\begin{aligned} \varvec{\varGamma }_i= & {} - (\varvec{\varPi }_{i+1} + \cdots + \varvec{\varPi }_k), \quad i = 1, \ldots , k-1,\\ \varvec{\varPi }= & {} -(\mathbf {I} - \varvec{\varPi }_1 - \cdots - \varvec{\varPi }_k), \end{aligned}$$

and $\varDelta $ is the first difference operator, i.e., $\varDelta \mathbf {Y}_t = \mathbf {Y}_t-\mathbf {Y}_{t-1}$.

We implement the tests developed by Johansen (1991) to test the hypothesis that $\mathbf {Y}_t$ is cointegrated of order (1, 1). Such hypothesis involves r, the rank of $\varvec{\varPi }$. If $r\le d-1$, one can write $\varvec{\varPi }=\varvec{\alpha }\varvec{\beta }'$ where $\varvec{\alpha }$ and $\varvec{\beta }$ are $d \times r$ matrices. The matrix $\varvec{\beta }$ contains r linear cointegration parameter vectors, whereas $\varvec{\alpha }$ is a matrix consisting of d error-correction parameter vectors (the so-called loadings). The maximum likelihood estimate of $\varvec{\alpha }$ is obtained using the OLS regression of $\varDelta \mathbf {Y}_t$ on $\varDelta \mathbf {Y}_{t-1}$, $\ldots $, $\varDelta \mathbf {Y}_{t-k+1}$ and a constant. Denote by $\varvec{\hat{\epsilon }}_{0t}$ the residuals. Similarly, the maximum likelihood estimate of $\varvec{\beta }$ can be obtained from the OLS regression of $\mathbf {Y}_t$ on $\varDelta \mathbf {Y}_{t-1}$, $\ldots $, $\varDelta \mathbf {Y}_{t-k+1}$ and a constant. In this case, denote by $\varvec{\hat{\epsilon }}_{1t}$ the residuals. Given the residuals, it is possible to calculate for $j=0,1$ the matrices $\varvec{S}_{ij}=T^{-1}\sum _{t=1}^T \varvec{\hat{\epsilon }}_{it} \varvec{\hat{\epsilon }}_{jt}'$. Let $\hat{\lambda }_1>\cdots >\hat{\lambda }_d$ be the eigenvalues obtained from solving the eigenvalue system $\left| \lambda \varvec{S}_{11}-\varvec{S}_{10} \varvec{S}_{00}^{-1}\varvec{S}_{01}\right| =0$, and $(\varvec{\hat{\psi }}_{1},\ldots ,\varvec{\hat{\psi }}_{d})$ the corresponding eigenvectors. The estimate for $\varvec{\beta }$, $\hat{\varvec{\beta }}$, is obtained as the juxtaposition of the eigenvectors associated with the r largest eigenvalues, and the one for $\varvec{\alpha }$ is obtained as $\hat{\varvec{\alpha }}=\varvec{S}_{01}\hat{\varvec{\beta }}$. Two Johansen’s maximum likelihood tests, the maximal eigenvalue test and the trace test, can then be used to determine the number of cointegration vectors. The statistic from the maximal eigenvalue test for the null hypothesis of r cointegration vectors against the alternative of $r+1$ cointegration vector is $\hat{\lambda }_{\text {max}}=-T\log (1-\hat{\lambda }_{r+1})$. The trace test statistic for the null hypothesis of at most r cointegration vectors is $\hat{\lambda }_{\text {trace}}=-T\sum _{i=r+1}^d\log (1-\hat{\lambda }_{i})$. If the results are consistent with the hypothesis of at least one cointegration vector, one then uses the maximum likelihood method to test the hypotheses regarding the restriction on $\varvec{\beta }$.

3 The relation between data breaches and Bitcoin metrics

3.1 Data

In this paper, we look for short-term and/or long-term relations between data breaches and Bitcoin-related variables. More specifically, we perform two distinct analyses based on two publicly available databases of data breaches.

The first database is taken from the Chronology of Data Breaches provided by the Privacy Rights Clearinghouse^{Footnote 7} (PRC). The PRC dataset is publicly available and constantly updated on the PRC Web site and has been used in other recent investigations (see for instance, Eling and Loperfido 2017; Maillart and Sornette 2010; Edwards et al. 2016; Wheatley et al. 2016; Farkas et al. 2019; Wheatley et al. 2019, 2020).

The second dataset is obtained from the Breach Level Index (BLI) Data Breach Database, a centralized, global database of data breaches with calculations of their severity based on multiple factors. The BLI tracks publicly disclosed breaches and also allows organizations to do their own risk assessment since, on the basis of a few simple inputs, it calculates their risk score, overall breach severity level, and summarizes possible actions to reduce the risk score. The dataset has been downloaded from the Web site of Gemalto, part of the Thales Group, one of the world leaders in digital security.^{Footnote 8}

We mention that the databases do not necessarily contain all of the hacking breach events because there may be unreported ones. The exposure to this type of risk is not easy to be tracked, since the population of potential victims that would report to registers is not stable through time or, at least not known in opposition to the type of information that comes from an insurer that might have a clearer view on the exposure, for example. Many organizations are not aware they have been breached or they are not required to report it according to the reporting laws. PRCs Chronology is limited to those reported in the USA. If a data breach affects individuals in other countries, it is included only if individuals in the USA are also affected. The data contain only the number of records affected by data breaches and do not include financial losses.

The PRC database is organized in industries and type of attack. The BLI database also includes the country interested by the breach.

As for the Bitcoin-related variables, we are mainly interested in the daily price and the daily number of transactions of Bitcoins. Our data source is taken from DataHub,^{Footnote 9} a project by Datopian and Open Knowledge International that provides publicly available high-quality datasets. As for the cryptocurrency, we focus on the historical prices (USD), on the number of transactions happening on the public blockchain during a given day.

The period of time we refer to, considering the dimensions of the datasets and in order to make them comparable, goes from January 1, 2013, to December 31, 2018. The work aims at investigating the idea that some breaches are treated by criminal organizations to make money: For this reason, the analysis focuses on malicious breaches related to hackers while negligent breaches and other sub-categories of malicious breaches included in the databases (i.e., insider, payment card fraud, ransomware, accidental unknown) have been ignored.

When researchers apply cointegration, both the sample size and the time span are relevant. Hakkio and Rush (1991) argue that cointegration is a long-run concept and hence long spans of data are needed for cointegration tests to have power and that gains from using more frequently sampled observation while keeping the same time span are “more apparent than real.” The Monte Carlo study of Zhou (2001), while confirming the importance of the time span, reveals that increasing the data frequency may yield substantial power gains. Since the considered time series on data breaches goes back only to 2013 and is obviously available only at a daily frequency, in the present paper, we use the longest span at the highest possible frequency. We believe that the time span is long enough to have reliable results regarding the short-term and long-term relations between the variables of interest.

The dynamic behavior of data breaches is represented by integer-valued time series displaying an unusual pattern which resembles a point process. Figure 1 plots the two time series generated with the datasets used. The figure clearly shows the impulsive nature intrinsic in time series of this kind. In such cases, standard cointegration analysis cannot be applied directly to the time series.^{Footnote 10} Our empirical strategy to overcome this problem is to extrapolate from the original dataset a new latent time series that generated the observed pattern of data breach. Then, we perform the cointegration analysis using the extrapolated latent time series instead of the observed data. This approach is quite common in cointegration analysis. We refer to Niu and Melenberg (2014) for cointegration analysis which uses latent factors.

We report additional details in the subsequent subsection.

3.2 Time series of counts and their intensities

Integer-valued GARCH models (henceforth INGARCH) constitute a popular class of models for time series of counts. Although the name might suggest some sort of affinity with the well-known GARCH models, INGARCHs are auto-regressive moving average processes constructed to model the dynamics of phenomena that are discrete. An INGARCH model allows the conditional expected value of a discrete random variable (or some transformation) that models a countable phenomenon to depend on its previous values and on previous observations of the phenomenon itself. As a general discussion of such processes is far beyond the scope of the present paper, we restrict ourselves to the presentation of model used to extract the latent time series and refer to Weiß (2018) for an outstanding introduction on INGARCH models.

Let $B_t$ be the observed breach size at time t. We assume that the conditional distribution of $B_t|(B_0,\ldots ,B_{t-1})$ of the observed breach size given the previous realizations follows a Negative Binomial distribution $NB\left( s,p_t = \frac{\mu _t}{s + \mu _t}\right) $ with probability mass

$$\begin{aligned} P(B_t = k) = \frac{\varGamma (s+k)}{\varGamma (k+1)\varGamma (s)} \left( \frac{\mu _t}{s + \mu _t}\right) ^k \left( \frac{s}{s + \mu _t}\right) ^s, \end{aligned}$$

so that we have $E\left( B_t|(B_0,\ldots ,B_{t-1})\right) =\mu _t$ and $Var\left( B_t|(B_0,\ldots ,B_{t-1})\right) = \mu _t + \frac{\mu _t^2}{s}$. In addition, we allow $\lambda _t = \log (\mu _t)$ to follow an ARMA process of order (p, q) of the following type:

$$\begin{aligned} \lambda _t = a_0 + \sum _{i=1}^p a_i \log \left( B_{t-i}+1\right) + \sum _{j=1}^q b_j \lambda _{t-j}. \end{aligned}$$

This model appeared the first time in Fokianos and Tjøstheim (2011). It is a generalization of the basic INGARCH model that allows for both positive and negative serial correlation. The choice of a logarithmic scale for the observed time series is needed to ensure the positivity of the conditional expectation $\mu _t$. Fokianos and Tjøstheim (2011) also show that adding a constant to the logarithmic transformation of the time series does not alter the estimation process. Although originally proposed in association with a Poisson distribution for the observed time series, the strong over-dispersion present in the breach data motivates our choice of a negative binomial distribution.

We use maximum likelihood to fit the INGARCH model with the main goal to extrapolate the time series of (logarithmic) conditional expected values $\lambda _t$. We plot the resulting extrapolated time series in Fig. 2 and observe that the dynamic patterns of the latent time series are well suited for standard cointegration analysis. Thus, in what follows we will use the extrapolated time series to question cointegration by using $\lambda _t$ instead of the original dataset. Accordingly, we specify the vector in (1) and (2) as $\mathbf {Y}_t = \left( C_t,P_t,\lambda ^h_t\right) '$, where $C_t$ and $P_t$ are, respectively, the logarithm of the daily number of transactions in Bitcoin and the daily Bitcoin’s price and $\lambda ^h_t$ as the logarithmic conditional expectation $h=\text {PRC, BLI}$.

3.3 Empirical evidence

The standard cookbook of cointegration analysis requires first a preliminary test to check the order of integration of the time series under investigation. According to the methodology explained in Sect. 2, to check whether or not each time series is integrated of order one, we perform the ADF-GLS test based on regressions (1) for each variable and each first-order difference and report the results in Table 1. From Table 1, we conclude that the time series under investigation are integrated with order of integration one. Indeed, the unit root tests performed on the levels of each variable under consideration lead to not rejecting the null hypothesis of $\phi = 0$, suggesting that the time series has a unit root, while the first-order difference leads to the rejection of the null hypothesis, meaning that stationarity is achieved after applying the first difference operator. In what follows, we discuss the results from vector error-correction model for each of the two datasets.

Table 1 Unit root tests by means of ADF-GLS

Full size table

3.3.1 PRC dataset

Having established that all the series involved in the analysis are I(1), here we determine whether there exists a cointegration relation between the variables. The Johansen’s tests are based on the rank of the matrix $\varvec{\varPi }$ of equation, r. The null hypothesis $r = 0$ implies no cointegration, while $r>0$ ($r=1,\ldots ,d-1$) means that there are r cointegrating relations. In the latter case, r distinct linear combinations of the variables—the cointegrating vectors—represent the long-run relation between the components of the multivariate time series. Table 2 presents the Johansen’s tests on the PRC dataset, where we follow the Box–Jenkins’ model selection technique to select the optimal order in the VAR model (3), identified to be 8 according to the Bayesian information criterion (BIC). Both the trace test and the maximal eigenvalue test agree to accept the null hypothesis $r = 2$.

Table 2 Johansen’s cointegration tests for the PRC dataset

Full size table

Having identified $r=2$ the rank of $\varvec{\varPi }$, we proceed by estimating the vector error-correction model with one cointegrating vector. Table 3 reports the resulting vector error-correction model for data breaches of PRC dataset, the daily number of transactions of Bitcoin and the Bitcoin’s price, from which we identify both a short-run and a long-run relation between the lagged variables $\varDelta C_{t}$ and $\varDelta \lambda ^{\mathrm{PRC}}_{t}$ and a long-run relation between $\varDelta P_t$ and $\varDelta \lambda ^{\mathrm{PRC}}_{t}$.

Table 3 Vector error-correction estimation. PRC dataset$^{\mathrm{a,b}}$

Full size table

In Table 3, the coefficients of the vector auto-regression marked as underlined highlight the short-run relation between the lagged logarithm of conditional expectations of data breaches and the lagged logarithm of the number of transactions in Bitcoin. More precisely, almost all the lagged variables $\varDelta C$ have a strong negative impact on the lagged conditional expectations of data breaches today, as one may observe from the value and the highly significance of the regression coefficients. This suggests that the number of transactions in Bitcoin might be a good predictor for data breaches. The intuition behind this result is that hackers prepare themselves to monetize the data attack (either by selling the data or by extorting money to the legitimate data proprietor) some days before, by operating on the Bitcoin market.

Furthermore, when looking at the equation for the Bitcoin transaction in the short-run component of the estimated VECM model, we find that almost all lagged variables $\varDelta \lambda ^{\mathrm{PRC}}$ are statistically significant when regressing $\varDelta C_t$. This result, coupled with the one regarding the equation for $\varDelta \lambda _t^{\mathrm{PRC}}$, implies that there is a lead–lag relation between data breaches and transactions in Bitcoins and the relation is bidirectional. Hence, our results confirm that some of the movements in cryptocurrency markets depend on illegalities (in our case cyber attacks). Also looking at Table 3, we find no short-run link between lagged conditional expectations of data breaches and Bitcoin price. Although the number of transactions has a strong impact on data breaches, this link is not necessarily reflected in a short-run impact on Bitcoin price.

The long-run relation between the variables under investigation is described by the existence of two cointegrating variables $\varPsi ^{\mathrm{BRC},1}_t = P_t - 0.08233 \lambda ^{\mathrm{PRC}}_t$ and $\varPsi ^{\mathrm{BRC},2}_t = C_t - 55.434 \lambda ^{\mathrm{PRC}}_t$, obtained by the estimation procedure detailed in Sect. 2.2. The double-underlined coefficient in Table 3 highlights that changes of $\varDelta \lambda _t$ are affected with very high statistical significance by the second cointegrating variable. Since both the logarithm of the number of transactions and the logarithm of conditional expectations of data breaches enter the second cointegrating vector, we highlight that the short-run impact found is likely to persist in the long time. Our conjecture about the intuition behind this long-term link between transactions in Bitcoin and data breaches is as follows. On the one hand, once the hackers have monetized the breach, they possess a bunch of Bitcoins that will later be used in some different contexts. On the other hand, a remunerative data breach creates incentives to prepare more cyber attacks, which in turn create the needs of more transactions in Bitcoin. We also find high statistical significance of both cointegrating variables in the equation for the change in Bitcoin prices. This implies that the effects of Bitcoin metrics and conditional expectations of data breach will impact also on Bitcoin returns in the long run. For instance, the negative and significant coefficient associated with $\varPsi ^1$ indicates that if the difference between the linear combination of Bitcoin price and the logarithm of the conditional expectations of data breaches is positive in one period, the price will fall during the next period to restore equilibrium, and vice versa.

Summarizing the empirical evidence discussed in this section, the change of expected number of data breaches recorded in the PRC dataset is statistically (and negatively) influenced in the short run by its lagged variables $\varDelta \lambda ^{\mathrm{PRC}}_{t_i}$, $i=1,\ldots ,3$,^{Footnote 11} and by the lagged levels of the number of transactions in Bitcoin some days before the attack. In the long run, deviations from the cointegration link, whose components are data breaches and transactions of Bitcoin, cause changes in the data breach intensities.

3.3.2 BLI dataset

The cointegration analysis of the BLI dataset fully confirms the existence of a strong, statistically significant, link between data breaches and number of transactions in Bitcoin, both in the short run and in the long run. Table 4 reports Johansen’s cointegration tests and highlights the existence of two cointegrating vectors. The optimal order for the VAR model is once again 8 and has been selected according to Box–Jenkins’ technique. The strong significance of underlined coefficients in Table 5 indicates the short-term relation. More specifically, the lagged variables $\varDelta C_{t-1}$, $\varDelta C_{t-2}$ and $\varDelta C_{t-6}$ impact heavily on the lagged value of the logarithmic conditional expectations of data breaches. We also find statistical significance for the reverse relation, especially in the variables $\varDelta \lambda ^{\mathrm{BLI}}_{t-6}$ and $\varDelta \lambda ^{\mathrm{BLI}}_{t-7}$. This confirms the intuition provided in the previous section for which data breaches have a statistical effect in the number of transactions in Bitcoin. The analysis also confirms the lack of a significant short-term relation between data breaches and price of Bitcoin.

Table 4 Johansen’s Cointegration Tests for BLI dataset

Full size table

Table 5 Vector error-correction estimation. BLI dataset$^{\mathrm{a,b}}$

Full size table

The two cointegrating variables $\varPsi ^{\mathrm{BLI},1}_t = P_t -0.524 \lambda ^{\mathrm{BLI}}_t$ and $\varPsi ^{\mathrm{BLI},2}_t = C_t - 354.95 \lambda ^{\mathrm{BLI}}_t$, obtained by inserting the estimated cointegrating of Table 5 into the VECM equation (3), describes the long-term relation among the three variables under investigation. We see that changes in logarithmic expectations of data breaches are due to changes in lagged logarithms of the number of transactions, the cointegrating variable itself (double-underlined coefficient of Table 5) and lagged logarithm of data breaches (bold coefficients in Table 5). This highlights once again the autoregressive structure of data breaches. Once again, the short-term relation between Bitcoin metrics and data breach is reflected in a long-run impact in the Bitcoin rate of return.

3.4 Granger causality tests

In this section, we perform a series of Granger causality tests (Granger 1969, 1988; Sims 1972) to provide further evidence on the lead–lag relationships between data breaches and Bitcoin metrics. Granger causality test is a standard tool for uncovering lead–lag relationships among economic variables. With reference to the most successful applications, we mention Chan (1992); Abhyankar (1998) among others. However, it is worth mentioning that novel methodologies for determining time-dependent lead–lag relations based on optimal thermal paths appeared recently, as in Meng et al. (2017); Xu et al. (2017).

Table 6 Granger causality tests for bivariate VARs

Full size table

Table 7 Granger causality tests for trivariate VARs

Full size table

We use first differences of the variables $\left( C_t,P_t,\lambda ^h_t\right) '$ and perform the tests on both bivariate and trivariate VAR models. In the case of bivariate models, testing for instance the null that $\varDelta C_t$ does not Granger-cause $\varDelta \lambda _t$, amounts to estimating the VAR comprising the two variables, and testing the null that the coefficients associated with the first variable are all zero in the equation for $\varDelta \lambda _t$, against the alternative that at least one is different than zero. In the case of the trivariate model (like the one in Eq. (3)), we perform the same test, but the VAR model includes also the remaining variable. We choose the order of autoregression according to the best Bayesian information criterion. We report the results in Tables 6 and 7. Underlined p values indicate rejection of the null hypothesis and thus Granger causality between the variables under investigation. We observe that both tables agree in all the cases under consideration but in one. Moreover, the results are fully consistent with the VECM models previously considered.

3.5 Impulse response analysis

Having discovered a causal link between Bitcoin metrics and conditional expectations of data breaches recorded in two different databases, we analyze the impulse response function (IRF) of the estimated VECM to evaluate the response of conditional expectations of data breaches with respect to unexpected shocks of the Bitcoin-related variables. The possibility of studying impulse responses in the context of cointegration analysis and the feasibility of combining the two approaches has been demonstrated by Lütkepohl and Reimers (1992).

In what follows, the IRF of variable i to shock j is defined as the sequence of the elements in the ith row and jth column of the matrices $\{\varvec{\varPhi }_k\}_{k=0,1,\ldots }$. Assuming that the error term in (2)–(3) can be written as a linear combination of mutually uncorrelated shocks with unit variance, i.e., $\varvec{\varepsilon }_t=\varvec{H}\varvec{u}_t$, these matrices are obtained as $\varvec{\varPhi }_k=\frac{\partial \varvec{Y}_t}{\partial \varvec{u}_{t-k}} =\frac{\partial \varvec{Y}_t}{\partial \varvec{\varepsilon }_{t-k}}\varvec{H}$. The matrix $\varvec{H}$ is assumed to be lower triangular, and its estimate is obtained as the Cholesky decomposition of the estimated variance covariance matrix of $\varvec{\varepsilon }_t$ (see Lütkepohl 2006, Chapter 9). Choosing $\varvec{H}$ to be lower triangular implies that, in general, the ordering of variables in the vector $\varvec{Y_t}$ is important. Since we are interested in the effects of Bitcoin-related variables on data breaches, we put the latter as the last variable in the ordering, so that in this setting the variable Delta $\lambda $ responds instantaneously to shocks associated with the remaining two variables. However, we have verified that, in our case, a different order does not change much the estimated impulse response function.

Figure 3 reports point estimates and 95% confidence intervals of the response of $\lambda ^h$, $h = PRC, BLI$, with respect to exogenous shocks of Bitcoin’s price and number of transactions of one standard deviation, for a period of 120 days, although the impact of the $C_t$ variable seems stronger in the short run (up to about two weeks), especially for the BLI dataset. For both datasets, the impulse response estimates associated with the $C_t$ variable cross the zero axis more often than the ones associated with the Bitcoin’s price do. More specifically, Panels (a) and (b) refer to changes of PRC data breaches, while panels (c) and (d) refer to changes of BLI data breaches due to shocks of Bitcoin-related variables. We note that the logarithm of the number of transactions in Bitcoins has a relevant impact on the future data breaches in both datasets. The size of the response is significantly different from zero, and the phenomenon continues to persist in the long run.

The confidence intervals associated with the points estimates are very tight in the BLI dataset, indicating low variability in the estimates, less tight in the PRC dataset. Less relevant is the response with respect to unexpected shocks of the Bitcoin’s price, at least until 15 days. Since after about 15 days the confidence interval is above the zero line, the impact of a shock of the Bitcoin’s price on data breaches becomes positive and significant in the case of the PRC dataset. On the other hand, a shock of the Bitcoin’s price has a negative and significant impact after about 15 days on the data breaches from the BLI dataset. The effect of a shock of Bitcoin’s price starts declining after 10 days and almost vanishes by the 30th day in both datasets.

3.6 Variance decomposition of forecasting errors

In this section, we offer further details on the contribution of each Bitcoin-related variable to the forecast power of the estimated VAR model.^{Footnote 12}

Table 8 Variance decomposition of forecast errors of data breaches intensities

Full size table

The variance of forecast error after h steps for variable i, $\omega _{h;i}^2$, is defined as the element in position i in the main diagonal of the matrix $\sum _{k=0}^h \varvec{\varPhi }_k\varvec{\varPhi }_k'$. The contribution of variable j to the variance of forecast error after h steps for variable i is calculated as $\mathrm {VDFE}_{h;i,j}=\frac{\sum _{k=0}^h(\phi _{k;i,j})^2}{\omega _{h;i}^2}$, where $\phi _{k;i,j}$ is the element in the ith row and jth column of matrix $\varvec{\varPhi }_k$.

Table 8 presents the variance decomposition of forecasting errors (VDFE) associated with the VECM estimations presented in Tables 3 and 5. VDFE is a classical tool used by econometricians to understand the impact of exogenous shocks of a given (independent) variable on the forecasting errors of a different (dependent) variable. In other words, VDFE helps us understand the contribution of Bitcoin-related variables in the forecasting power of our model for data breaches.

Looking at the results presented in Table 8, we see that, in the PRC dataset, the contribution of number of transactions in Bitcoin in explaining the variance of forecasting errors of data breaches is irrelevant when the forecasting horizon is short (up to 5 days). However, for forecasting horizon greater than 5 days an exogenous shock in the independent variable is able to explain up to 4% of the variance of the forecasting vector. The contribution of Bitcoin’s price seems to be to be even more important, as it can explain up to 7.3% of the total variance. In total, for the PRC dataset, Bitcoin metrics are able to explain more than 11.3% of the total variance of the forecasting errors. The VDFE of the BLI dataset displays a reduced relevance of Bitcoin metrics. The number of transactions in Bitcoins is able to explain at most 1.1% of the entire variance and Bitcoin’s prices only up to 0.47%.

4 Concluding remarks

In this paper, we uncover the strong, bidirectional, relation between data breaches and Bitcoin-related variables. Our analysis suggests that in the short run the lagged values of the number of transactions in Bitcoin have a strong negative impact on data breaches today. In the long run, the existence of a cointegrating vector including all variables under investigation implies that the short-run relation will persist in the long run. Moreover, we find almost identical results on two different datasets, confirming the robustness of our result. The impulse response analyses highlight the relevant quantitative impact of both the number of transactions and the Bitcoin price on future data breaches, while the variance decomposition of the forecasting errors suggests that the same variable can explain up to 5% of the variability.

Our results might open up new research directions. First, on the econometric side, a deeper understanding of the relation between cyber risk and cryptocurrencies might be helpful. Indeed, one might wonder whether the relations found in this paper extend to other class of cyber risk and to different cryptocurrencies. With the availability of new datasets of cyber attacks, such analyses are going to become feasible in the near future. Second, on the actuarial side, the next step is to create a risk model for data breaches which includes exogenous factors as cryptocurrency-related variables. This might improve the forecasting procedures and give insurers a better understanding of cyber risk. An analysis of the impact of our findings on classical actuarial measures might also be of great interest. On the other hand, the relevance of cryptocurrencies in an international financial context is increasing. In this scenarios, understanding the connections among cryptocurrencies, macroeconomic variables and other factors such as data breaches is surely an interesting point to further explore.

Notes

The Ponemon Institute is dedicated to independent research and education that advances responsible information and privacy management practices within business and government. For the past 13 years, the Ponemon Institute has conducted an annual Cost of a Data Breach Study in order to measure exactly how much lost and stolen records could cost companies around the world. More details can be found on the official Web site https://www.ponemon.org/.
Allianz Barometer 2018, https://www.internetsociety.org/ota/. Last accessed December 2019.
https://assets.kpmg/content/dam/kpmg/xx/pdf/2017/07/cyber-insurance-report.pdf, pag 10.
P. R. Clearinghouse. Privacy Rights Clearinghouse’s Chronology of Data Breaches. Accessed: Dec. 2019. [Online]. Available: https://www.privacyrights.org/data-breaches.
We use the notation I(a) for a time series that is integrated of order a.
To derive the VECM specification, it suffices to use, for $i=2,\ldots ,k$, the identity $\mathbf {Y}_{t-i} =\mathbf {Y}_{t-1} -\sum _{h=1}^{i-1}\varDelta \mathbf {Y}_{t-h}$ in the VAR equation.
The Privacy Rights Clearinghouse is a US non-profit organization founded in 1992 whose aim is the privacy protection for US citizens by empowering individuals and advocating for positive change. The dataset is available at https://privacyrights.org/data-breaches.
We downloaded the database from https://breachlevelindex.com in April 2019. However, to the best of our knowledge, the accessibility policy of Gemalto seems to be changed.
https://datahub.io/.
We are extremely grateful to an anonymous reviewer for pointing out this issue.
See the statistical significance of coefficients highlighted in bold in Table 3. This is a statistical evidence that hackers usually perform their attacks to different organizations in a small period of time.
We have performed a complete variance decomposition analysis of the forecasts errors, showing also the contribution of data breaches to the forecast power of the VAR model. However, to keep the paper in focus, we omit the presentation of such results. We make this additional material available upon request.

References

Abhyankar, A.: Linear and nonlinear granger causality: Evidence from the UK stock index futures market. J. Futur. Mark. (1986-1998) 18(5), 519 (1998)
Article Google Scholar
Campbell, J.Y., Shiller, R.J.: Cointegration and tests of present value models. J. Polit. Econ. 95(5), 1062–1088 (1987)
Article Google Scholar
Chan, K.: A further analysis of the lead-lag relationship between the cash market and stock index futures market. Rev. Financ. Stud. 5(1), 123–152 (1992)
Article Google Scholar
Chen, B.: Long-run purchasing power parity: evidence from some European monetary system countries. Appl. Econ. 27(4), 377–383 (1995)
Article Google Scholar
Ciner, C.: Dynamic linkages between international bond markets. J. Multinatl. Financ. Manag. 17(4), 290–303 (2007)
Article Google Scholar
Edwards, B., Hofmeyr, S., Forrest, S.: Hype and heavy tails: A closer look at data breaches. J. Cybersecur. 2(1), 3–14 (2016)
Article Google Scholar
Eling, M., Loperfido, N.: Data breaches: Goodness of fit, pricing, and risk measurement. Insur. Math. Econ. 75, 126–136 (2017)
Article Google Scholar
Eling, M., Wirfs, J.: What are the actual costs of cyber risk events? European J. Operat. Res. 272(3), 1109–1119 (2019)
Article Google Scholar
Elliott, G., Rothenberg, T.J., Stock, J.H.: Efficient tests for an autoregressive unit root. Econometrica 64(4), 813–836 (1996)
Article Google Scholar
Farkas, S., Lopez, O., Thomas M.: Cyber claim analysis through Generalized Pareto Regression Trees with applications to insurance pricing and reserving (2019). https://hal.archives-ouvertes.fr/hal-02118080
Fokianos, K., Tjøstheim, D.: Log-linear poisson autoregression. J. Multivar. Anal. 102(3), 563–578 (2011)
Article Google Scholar
Granger, C.W.: Investigating causal relations by econometric models and cross-spectral methods. Econometrica 37(3), 424–438 (1969)
Article Google Scholar
Granger, C.W.: Some recent development in a concept of causality. J. Econom. 39(1–2), 199–211 (1988)
Article Google Scholar
Hakkio, C.S., Rush, M.: Cointegration: how short is the long run? J. Int. Money and Finance 10(4), 571–581 (1991)
Article Google Scholar
Johansen, S.: Estimation and hypothesis testing of cointegration vectors in gaussian vector autoregressive models. Econometrica 59(6), 1551–1580 (1991)
Article Google Scholar
Lütkepohl, H.: New Introduction To Multiple Time Series Analysis. Springer, Berlin (2006)
Google Scholar
Lütkepohl, H., Reimers, H.-E.: Impulse response analysis of cointegrated systems. J. Econ. Dyn. Control 16(1), 53–78 (1992)
Article Google Scholar
Maillart, T., Sornette, D.: Heavy-tailed distribution of cyber-risks. European Phys. J. B 75(3), 357–364 (2010)
Article Google Scholar
Marotta, A., Martinelli, F., Nanni, S., Orlando, A., Yautsiukhin, A.: Cyber-insurance survey. Comput. Sci. Rev. 24, 35–61 (2017)
Article Google Scholar
Meng, H., Xu, H.-C., Zhou, W.-X., Sornette, D.: Symmetric thermal optimal path and time-dependent lead-lag relationship: novel statistical tests and application to uk and us real-estate and monetary policies. Quant. Finance 17(6), 959–977 (2017)
Article Google Scholar
Mills, T.C., Mills, A.G.: The international transmission of bond market movements. Bull. Econ. Res. 43(3), 273–281 (1991)
Article Google Scholar
Niu, G., Melenberg, B.: Trends in mortality decrease and economic growth. Demography 51(5), 1755–1773 (2014)
Article Google Scholar
Pippenger, M.K.: Cointegration tests of purchasing power parity: the case of Swiss exchange rates. J. Int. Money and Finance 12(1), 46–61 (1993)
Article Google Scholar
Shea, G.S.: Benchmarking the expectations hypothesis of the interest-rate term structure: An analysis of cointegration vectors. J. Bus. Econ. Stat. 10(3), 347–366 (1992)
Google Scholar
Sims, C.A.: Money, income, and causality. Am. Econ. Rev. 62(4), 540–552 (1972)
Google Scholar
Tse, Y.K.: Lead-lag relationship between spot index and futures price of the Nikkei stock average. J. Forecast. 14(7), 553–563 (1995)
Article Google Scholar
Weiß, C.H.: An Introduction to Discrete-valued Time Series. Wiley, New York (2018)
Book Google Scholar
Wheatley, S., Hofmann, A., Sornette, D.: Data breaches in the catastrophe framework & beyond (2019) arXiv:1901.00699
Wheatley, S., Hofmann, A., Sornette D.: Addressing insurance of data breach cyber risks in the catastrophe framework. The Geneva Papers on Risk and Insurance-Issues and Practice (2020). https://doi.org/10.1057/s41288-020-00163-w
Wheatley, S., Maillart, T., Sornette, D.: The extreme risk of personal data breaches and the erosion of privacy. European Phys. J. B 89(1), 7 (2016)
Article Google Scholar
Xu, H.-C., Zhou, W.-X., Sornette, D.: Time-dependent lead-lag relationship between the onshore and offshore renminbi exchange rates. J. Int. Financ. Markets Inst. Money 49, 173–183 (2017)
Article Google Scholar
Xu, M., Schweitzer, K.M., Bateman, R.M., Xu, S.: Modeling and predicting cyber hacking breaches. IEEE Trans. Inf. Forensics Secur. 13(11), 2856–2871 (2018)
Article Google Scholar
Zhou, S.: The power of cointegration tests versus data frequency and time spans. South. Econ. J. 67(4), 906–921 (2001)
Article Google Scholar

Download references

Acknowledgements

We wish to thank the Lead Guest Editor, Marcello Galeotti, and two anonymous referees for very useful comments which have helped to improve the paper. The usual disclaimer applies.

Funding

Open access funding provided by Università della Calabria within the CRUI-CARE Agreement.

Author information

Authors and Affiliations

Università della Calabria, Rende (CS), Italy
Domenico De Giovanni, Arturo Leccadito & Marco Pirra

Authors

Domenico De Giovanni
View author publications
You can also search for this author in PubMed Google Scholar
Arturo Leccadito
View author publications
You can also search for this author in PubMed Google Scholar
Marco Pirra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marco Pirra.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional details of VECM estimation

See Tables 9, 10, 11 and 12.

Table 9 Detailed statistics for VECM: BLI dataset

Full size table

Table 10 Cross-section covariance matrix: BLI dataset

Full size table

Table 11 Detailed statistics for VECM: PRC dataset

Full size table

Table 12 Cross-section covariance matrix: PRC dataset

Full size table

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

De Giovanni, D., Leccadito, A. & Pirra, M. On the determinants of data breaches: A cointegration analysis. Decisions Econ Finan 44, 141–160 (2021). https://doi.org/10.1007/s10203-020-00301-y

Download citation

Received: 24 January 2020
Accepted: 07 August 2020
Published: 05 September 2020
Issue Date: June 2021
DOI: https://doi.org/10.1007/s10203-020-00301-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

On the determinants of data breaches: A cointegration analysis

Abstract

Similar content being viewed by others

Cyber-Attacks, Cryptocurrencies and Cyber Security

On cointegration and cryptocurrency dynamics

Cybersecurity hazards and financial system vulnerability: a synthesis of literature

1 Introduction