Estimation of high-dimensional integrated covariance matrix based on noisy high-frequency data with multiple observations
Introduction
The estimation of population covariance matrix is a fundamental problem in statistics. It is well known that, even based on the i.i.d. (independent and identically distributed) samples, the sample covariance matrix is not a consistent estimator in high-dimensional setting when the dimension and the sample size go to infinity proportionally. Hence, a large number of studies have been worked on this problem. Another challenge to estimate covariance matrix is that the condition of i.i.d. may be too strong for practical use, especially in financial applications. For example, as the rapid development of computer science, the tick-by-tick high-frequency data have become increasingly available. It is commonly assumed that the latent log price process, which is denoted by , follows the following diffusion model where is a -dimensional log price process, is a -dimensional drift process, is a matrix, which is called covolatility process, and is a -dimensional standard Brownian motion. The interval is the period of interest, say, one trading day. The interested covariation between asset returns is the so-called integrated covariance (ICV) matrix, which is defined as . The ICV matrix plays a crucial role in financial applications, such as portfolio optimization and risk management. Motivated by the wide applicability of high-frequency data, in this paper, we consider the estimation of ICV matrix in high-dimensional setting.
The estimation of ICV matrix is very difficult. The first difficulty is high dimensionality. When the dimension is of the same order of magnitude as the sample size , it is impossible to estimate free parameters from a data set of order . Hence, a special structure is usually assumed on covariance matrix, such as sparsity (e.g. Fan et al., 2013). When no particular structure is imposed, random matrix theory is an effective tool to analyze high-dimensional covariance matrices (see Bai and Silverstein, 2010 for details). The second difficulty is microstructure noise. In practice, the observed high-frequency data are always contaminated by the market microstructure noise, which is induced by various frictions in the trading process. The accumulated microstructure noise badly affects the statistical inference about the latent price process. Hence, the estimator of ICV matrix has to be built based on the noisy observations. The third difficulty is multiple transactions. With tick-by-tick transaction data, there are often more than one record within one recording time interval (see Figure 1 in the supplement for details). With the presence of multiple transactions, one issue is that the order information of transactions within each recording time interval is not available or incorrectly recorded; another issue is asynchronous trading, which means that different stocks are not traded synchronously during each time interval.
In this paper, we consider the estimation of high-dimensional ICV matrix based on noisy high-frequency data with multiple transactions. Using random matrix theory, we propose a nonlinear shrinkage estimator of ICV matrix by retaining the eigenvectors and nonlinearly shrinkaging the eigenvalues of generalized sample covariance matrix based on self-normalized returns. We show that the proposed estimator has two desirable properties: it eliminates both impacts of microstructure noise and multiple transactions, and its limiting nonlinear shrinkage function solely depends on the limiting spectral distribution of the generalized sample covariance matrix. For financial application, we further prove that our proposed estimator is asymptotically optimal for portfolio selection.
Notation For any matrix , we use to denote its Frobenius norm. For any Hermitian matrix , the empirical spectral distribution (ESD) is defined as , where are the eigenvalues of and denotes the indicator function. The limit of ESD as is referred to as the limiting spectral distribution (LSD), if it exists. The Stieltjes transform of a bounded variation function is defined by , where denotes the support of function G. For any vector , stands for its Euclidean norm and is the spectral norm of . Let be a diagonal matrix by setting the non-diagonal entries of to be zero.
Section snippets
Observations and main results
In practice, instead of the latent log price process , we have contaminated data , where is given in (1) and denotes the noise process. In the presence of multiple transactions, denotes the number of transactions for th stock at recording time , for and . For any process (can be either , or ), suppose that is the th observation for th stock during time interval with , and . The
Proofs
Proof of Theorem 1 Let be the LSD of , be the corresponding Stieltjes transform, for all and . Define , where is the corresponding eigenvector to the th largest eigenvalue of . The existence of the above functions can be found in Theorem 2.3 of Wang et al. (2019) and Lemma 6.1 of Bai and Silverstein (2010). Firstly, we show that the results in Theorem 1 hold when the notations based on are
Acknowledgments
The authors thank Yimin Xiao (the editor), the associate editor, and the anonymous referees for their helpful comments that improved the article significantly. Wang’s work is supported by National Natural Science Foundation of China (11871322) and Shanghai University of Finance and Economics Graduate Innovation Program Project Research Innovation Fund (CXJI-2018-411). Xia’s research is supported by the National Natural Science Foundation of China Grant 11871322.
References (9)
- et al.
Microstructure noise in the continuous case: the pre-averaging approach
Stochastic Process. Appl.
(2009) - et al.
Spectral Analysis of Large Dimensional Random Matrices
(2010) - et al.
Large covariance estimation by thresholding principal orthogonal complements
J. R. Stat. Soc.
(2013) - et al.
Eigenvectors of some large sample covariance matrix ensembles
Probab. Theory Related Fields
(2011)