Skip to main content
Log in

Consistent multiple changepoint estimation with fused Gaussian graphical models

  • Published:
Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Abstract

We consider the consistency properties of a regularised estimator for the simultaneous identification of both changepoints and graphical dependency structure in multivariate time-series. Traditionally, estimation of Gaussian graphical models (GGM) is performed in an i.i.d setting. More recently, such models have been extended to allow for changes in the distribution, but primarily where changepoints are known a priori. In this work, we study the Group-Fused Graphical Lasso (GFGL) which penalises partial correlations with an L1 penalty while simultaneously inducing block-wise smoothness over time to detect multiple changepoints. We present a proof of consistency for the estimator, both in terms of changepoints, and the structure of the graphical models in each segment. We contrast our results, which are based on a global, i.e. graph-wide likelihood, with those previously obtained for performing dynamic graph estimation at a node-wise (or neighbourhood) level.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

Notes

  1. For a reference on static graphical models the reader is directed to Lauritzen (1996).

References

  • Angelosante, D., Giannakis, G. B. (2011). Sparse graphical modeling of piecewise-stationary time-series. In: International conference on acoustics, speech and signal processing (ICASSP) (pp. 1960–1963).

  • Bai, J. (1997). Estimation of a change point in multiple regression models. The Review of Economics and Statistics, 79(4), 551–563.

    Article  Google Scholar 

  • Banerjee, O., Ghaoui, L. E. (2008). Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. Journal of Machine Learning Research, 9, 485–516.

    MathSciNet  MATH  Google Scholar 

  • Cai, T., Liu, W., Luo, X. (2011). A constrained l1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494), 594–607.

    Article  MathSciNet  Google Scholar 

  • Fryzlewicz, P. (2014). Wild binary segmentation for multiple change-point detection. Annals of Statistics, 42(6), 2243–2281.

    Article  MathSciNet  Google Scholar 

  • Gibberd, A. J., Nelson, J. D. B. (2014). High dimensional changepoint detection with a dynamic graphical lasso. In: ICASSP, IEEE international conference on acoustics, speech and signal processing - proceedings (pp. 2684–2688).

  • Gibberd, A. J., Nelson, J. D. B. (2017). Regularized estimation of piecewise constant Gaussian graphical models : The group-fused graphical Lasso. Journal of Computational and Graphical Statistics, 26(3), 623.

    Article  MathSciNet  Google Scholar 

  • Hallac, D., Park, Y., Boyd, S., Leskovec, J. (2017). Network inference via the time-varying graphical lasso. In: ACM SIGKDD international conference on knowledge discovery and data mining (KDD).

  • Harchaoui, Z., Lévy-Leduc, C. (2010). Multiple change-point estimation with a total variation penalty. Journal of the American Statistical Association, 105(492), 1480–1493.

    Article  MathSciNet  Google Scholar 

  • Hinkley, D. V. (1970). Inference about the change point in a sequence of random variables. Biometrika, 57(1), 1–17.

    Article  MathSciNet  Google Scholar 

  • Killick, R., Fearnhead, P., Eckley, I. A. (2012). Optimal detection of changepoints with a linear computational cost. Journal of the American Statistical Association, 107(500), 1590–1598.

    Article  MathSciNet  Google Scholar 

  • Kolar, M., Xing, E. P. (2012). Estimating networks with jumps. Electronic Journal of Statistics, 6, 2069–2106.

    Article  MathSciNet  Google Scholar 

  • Lauritzen, S. L. (1996). Graphical models. Oxford: Oxford University Press.

    MATH  Google Scholar 

  • Lee, S., Seo, M. H., Shin, Y. (2016). The lasso for high dimensional regression with a possible change point. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(1), 193–210.

    Article  MathSciNet  Google Scholar 

  • Leonardi, F., Bühlmann, P. (2016). Computationally efficient change point detection for high-dimensional regression (pp. 1–32). arXiv:1601.03704.

  • Meinshausen, N. (2008). A note on the Lasso for Gaussian graphical model selection. Statistics & Probability Letters, 78, 880–884.

    Article  MathSciNet  Google Scholar 

  • Meinshausen, N., Bühlmann, P. (2006). High dimensional graphs and variable selection with the Lasso. The Annals of Statistics, 34(3), 1436–1462.

    Article  MathSciNet  Google Scholar 

  • Meinshausen, N., Yu, B. (2009). Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics, 37(1), 246–270.

    Article  MathSciNet  Google Scholar 

  • Monti, R. P., Hellyer, P., Sharp, D., Leech, R., Anagnostopoulos, S., Montana, G. (2014). Estimating time-varying brain connectivity networks from functional MRI time series. NeuroImage, 103, 427–443.

    Article  Google Scholar 

  • Negahban, S. N., Ravikumar, P., Wainwright, M. J., Yu, B. (2012). A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Statistical Science, 27(4), 538–557.

    Article  MathSciNet  Google Scholar 

  • Raimondo, M. (1998). Minimax estimation of sharp change points. Annals of Statistics, 26(4), 1379–1397.

    Article  MathSciNet  Google Scholar 

  • Ravikumar, P., Wainwright, M. J., Raskutti, G., Yu, B. (2010). High-dimensional covariance estimation by minimizing l1-penalized log-determinant divergence. Electronic Journal of Statistics, 5, 935–980.

    Article  Google Scholar 

  • Rothman, A. J., Bickel, P. J., Levina, E., Zhu, J. (2008). Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2, 494–515.

    Article  MathSciNet  Google Scholar 

  • Roy, S., Atchadé, Y., Michailidis, G. (2016). Change point estimation in high dimensional Markov random-field models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79, 1187–1206.

    Article  MathSciNet  Google Scholar 

  • Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using l1-constrained quadratic programming (Lasso). IEEE Transactions on Information Theory, 55(5), 2183–2202.

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. Gibberd.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Proof of changepoint consistency

We relate the proof bounding the maximum deviation between estimated and true changepoints to the probability of an individual changepoint breaking the bound. Following (Harchaoui and Lévy-Leduc 2010), we utilise the union bound

$$\begin{aligned} P\left[ \max _{k\in [K]}|\tau _{k}-\hat{\tau }_{k}|\ge T\delta _{T}\right] \le \sum _{k\in [K]}P\left[ |\tau _{k}-\hat{\tau }_{k}|\ge T\delta _{T}\right] . \end{aligned}$$

The complement of the event on the LHS is equivalent to the target of proof; we wish to demonstrate \(P[\max _{k}|\tau _{k}-\hat{\tau }_{k}|\le T\delta _{T}]\rightarrow 1\). In order to show this, we need to show the LHS above goes to zero as \(T\rightarrow \infty \). It is sufficient, via the union bound, to demonstrate that the probability of the bad events:

$$\begin{aligned} A_{T,k}:=\left\{ |\tau _{k}-\hat{\tau }_{k}|>T\delta _{T}\right\} , \end{aligned}$$
(7)

go to zero for all \(k\in [K]\). The strategy presented here separates the probability of \(A_{T,k}\) occurring across complementary events. In particular, let us construct what can be thought of as a good event, where the estimated changepoints are within a region of the true ones:

$$\begin{aligned} C_{T}:=\left\{ \underset{k\in [K]}{\max }|\hat{\tau }_{k}-\tau _{k}|<d_{\min }/2\right\} . \end{aligned}$$
(8)

The task is then to show that \(P[A_{T,k}]\rightarrow 0\) by showing \(P[A_{T,k}\cap C_{T}]\rightarrow 0\) and \(P[A_{T,k}\cap C_{T}^{c}]\rightarrow 0\) as \(T\rightarrow 0\).

1.1 Stationarity-induced bounds

As a first step, let us introduce some bounds based on the optimality conditions which occur in probability one. We base our results on a set of events which occur in relation to these conditions. By intersecting these events with \(A_{T,k}\cap C_{T}\) and \(A_{T,k}\cap C_{T}^{c}\), we can construct an upper bound on the probability for changepoint error exceeding a level \(T\delta _{T}\).

Without loss of generality, consider the optimality equations (Lemma. 1) with changepoints \(l=\tau _{k}\) and \(l=\hat{\tau }_{k}\) such that \(\hat{\tau }_{k}<\tau _{k}\). We note, that an argument for the reverse situation \(\tau _{k}>\hat{\tau }_{k}\) follows through symmetry. Taking the differences between the equations we find

$$\begin{aligned} \bigg \Vert \sum _{t=\hat{\tau }_{k}}^{\tau _{k}-1}(\hat{\Sigma }^{(t)}-\Sigma _{0}^{(t)})-\sum _{t=\hat{\tau }_{k}}^{\tau _{k}-1}W^{(t)}+\lambda _{1}\sum _{t=\hat{\tau }}^{\tau _{k}-1}\hat{R}_{1}^{(t)}\bigg \Vert _{F}\le 2\lambda _{2}. \end{aligned}$$
(9)

The gradient from the \(\ell _{1}\) term \(\sum _{t=\hat{\tau }_{k}}^{\tau _{k}-1}\lambda \hat{R}_{1}^{(t)}\) can obtain a maximum value of \(\pm \lambda _{1}(\tau _{k}-\hat{\tau })\) for each entry in the precision matrix. Transferring this to the RHS and splitting the LHS in terms of the stochastic and estimated terms, we obtain

$$\begin{aligned} {\bigg \Vert \sum _{t=\hat{\tau }_{k}}^{\tau _{k}-1}(\hat{\Sigma }^{(t)}-\Sigma ^{(t)})\bigg \Vert _{F}}-\bigg \Vert \sum _{t=\hat{\tau }_{k}}^{\tau _{k}-1}W^{(t)}\bigg \Vert _{F}\le 2\lambda _{2}+\lambda _{1}\sqrt{p(p-1)}(\tau _{k}-\hat{\tau }_{k}). \end{aligned}$$
(10)

The next step is to replace the time indexed inverse precision matrices \(\Theta ^{(t)}\) with the block-covariance matrices indexed \(\Sigma _0^{(k)}\) and \(\Sigma _0^{(k+1)}\). We can re-express the difference in precision matrices as the sum of a difference between true values before \(\tau _{k}\) , i.e. \(\Sigma _0^{(k+1)}-\Sigma _0^{(k)}\), and the difference between the next (\(k+1\))st true block and estimated block, i.e. \(\hat{\Sigma }^{(k+1)}-\Sigma _0^{(k+1)}\) to obtain:

$$\begin{aligned} \lambda _{2}+\lambda _{1}\sqrt{p(p-1)}(\tau _{k}-\hat{\tau }_{k})&\ge {} \underset{\Vert R_{1}\Vert _{F}}{\underbrace{\Vert \sum _{t=\hat{\tau }_{k}}^{\tau _{k}-1}(\Sigma _0^{(k)}-\Sigma _0^{(k+1)})\Vert _{F}}}-\underset{\Vert R_{2}\Vert _{F}}{\underbrace{\Vert \sum _{t=\hat{\tau }_{k}}^{\tau _{k}-1}(\hat{\Sigma }^{(k+1)}-\Sigma _0^{(k+1)})\Vert _{F}}}\nonumber \\&\quad -\,\underset{\Vert R_{3}\Vert _{F}}{\underbrace{\Vert \sum _{t=\hat{\tau }_{k}}^{\tau _{k}-1}W^{(t)}\Vert _{F}}}, \end{aligned}$$
(11)

which holds with probability one. Define the events:

$$\begin{aligned} E_{1}&:=\{\lambda _{2}+\lambda _{1}\sqrt{p(p-1)}(\tau _{k}-\hat{\tau }_{k})\ge \frac{1}{3}\Vert R_{1}\Vert _{F}\}\\ E_{2}&:=\{\Vert R_{2}\Vert _{F}\ge \frac{1}{3}\Vert R_{1}\Vert _{F}\}\\ E_{3}&:=\{\Vert R_{3}\Vert _{F}\ge \frac{1}{3}\Vert R_{1}\Vert _{F}\} \end{aligned}$$

Since we know that the bound (11) occurs with probability one, then the union of these three events must also occur with probability one, i.e. \(P[E_{1}\cup E_{2}\cup E_{3}]=1\).

1.2 Bounding the good cases

One of the three events above is required to happen, either together, or separately. We can thus use this to bound the probability of both the good \(C_{T}\) and bad \(A_{T,k}\) events. Similar to Harchaoui and Lévy-Leduc (2010), Kolar and Xing (2012) we obtain

$$\begin{aligned} P[A_{T,k}\cap C_{T}]&\le {} P[\overset{A_{T,k,1}}{\overbrace{A_{T,k}\cap C_{T}\cap E_{1}}}]+P[\overset{A_{T,k,2}}{\overbrace{A_{T,k}\cap C_{T}\cap E_{2}}}]+P[\overset{A_{T,k,3}}{\overbrace{A_{T,k}\cap C_{T}\cap E_{3}}}]. \end{aligned}$$

The following sub-sections describe how to separately bound these sub-events.

Unlike in the work of Kolar and Xing (2012), there is no stochastic element (related to the data \(X_{t}\)) within the first event \(A_{T,k,1}\). We can bound the probability of \(P[A_{T,k,1}]\) by considering the event \(\{\frac{1}{3}\Vert R_{1}\Vert _{F}\le \lambda _{2}+\lambda _{1}\sqrt{p(p-1)}(\tau _{k}-\hat{\tau }_{k})\}\). Given \(\Vert R_{1}\Vert _{F}=\Vert \sum _{t=\hat{\tau }_{k}}^{\tau _{k}-1}\Sigma _0^{(k)}-\Sigma _0^{(k+1)}\Vert _{F}\ge (\tau _{k}-\hat{\tau }_{k})\eta _{\min }\) we therefore obtain the bound

$$\begin{aligned} P[A_{T,k,1}]\le P[(\tau _{k}-\hat{\tau }_{k})\eta _{\min }/3\le \lambda _{2}+\lambda _{1}\sqrt{p(p-1)}(\tau _{k}-\hat{\tau }_{k})]. \end{aligned}$$

When the events \(C_{T},A_{T,k}\) occur we have \(T\delta _{T}<\tau _{k}-\hat{\tau }_{k}\le d_{\min }/2\) to ensure the event \(A_{T,k,1}\) does not occur, we need:

$$\begin{aligned} \eta _{\min }T\delta _{T}>3\lambda _{2}\quad ;\quad \eta _{\min }>3\lambda _{1}\sqrt{p(p-1)}. \end{aligned}$$
(12)

These conditions are satisfied by Assumption 1. Thus, for a large enough T, we can show that the probability \(P[A_{T,k,1}]=0\), the size of this T depends on the quantities in Eq. (12).

Now let us consider the event \(A_{T,k,2}\). Consider the quantity \(\bar{\tau }_{k}:=\lfloor (\tau _{k}+\tau _{k+1})/2\rfloor \). On the event \(C_{n}\), we have \(\hat{\tau }_{k+1}>\bar{\tau }_{k}\) so \(\hat{\Sigma }^{(t)}=\hat{\Sigma }^{(k+1)}\) for all \(t\in [\tau _{k},\bar{\tau }_{k}]\). Using the optimality conditions (Proposition 1) with changepoints at \(l=\bar{\tau }_{k}\) and \(l=\tau _{k}\) we obtain

$$\begin{aligned} 2\lambda _{2}+\lambda _{1}\sqrt{p(p-1)}(\bar{\tau }_{k}-\tau _{k})\ge \bigg \Vert \sum _{t=\tau _{k}}^{\bar{\tau }_{k}-1}(\hat{\Sigma }^{(k+1)}-\Sigma _0^{(k+1)})\bigg \Vert _{F}-\bigg \Vert \sum _{t=\tau _{k}}^{\bar{\tau }_{k}-1}W^{(t)}\bigg \Vert _{F}, \end{aligned}$$

and thus

$$\begin{aligned} \Vert \hat{\Sigma }^{(k+1)}-\Sigma _0^{(k+1)}\Vert _{F}\le \frac{4\lambda _{2}+2\lambda _{1}\sqrt{p(p-1)}(\bar{\tau }_{k}-\tau _{k})+2\Vert \sum _{t=\tau _{k}}^{\bar{\tau }_{k}-1}W^{(t)}\Vert _{F}}{\tau _{k+1}-\tau _{k}}. \end{aligned}$$
(13)

We now combine the bounds for events \(E_{1}\) and \(E_{2}\), via \(E_{2}:=\{\Vert R_{2}\Vert _{F}\ge \frac{1}{3}\Vert R_{1}\Vert _{F}\}\) and the bounds \(\Vert R_{1}\Vert _{F}\ge (\tau _{k}-\hat{\tau }_{k})\eta _{\min }\) and \(\Vert R_{2}\Vert _{F}\le (\tau _{k}-\hat{\tau }_{k})\Vert \hat{\Sigma }^{k+1}-\Sigma _0^{k+1}\Vert _{F}\) . Substituting in (13) we have

$$\begin{aligned} P[A_{T,k,2}]\le P[E_{2}]=P\left[ \eta _{\min }\le \frac{12\lambda _{2}+6\lambda _{1}\sqrt{p(p-1)}(\bar{\tau }_{k}-\tau _{k})+6\Vert \sum _{t=\tau _{k}}^{\bar{\tau }_{k}-1}W^{(t)}\Vert _{F}}{\tau _{k+1}-\tau _{k}}\right] . \end{aligned}$$
(14)

Splitting the probability into three components, we obtain

$$\begin{aligned} P[A_{T,k,2}]\le P[\eta _{\min }d_{\min }\le 12\lambda _{2}]+P[\eta _{\min }\le 3\lambda _{1}\sqrt{p(p-1)}]+P\left[ \eta _{\min }\le \frac{6\Vert \sum _{t=\tau _{k}}^{\bar{\tau }_{k}-1}W^{(t)}\Vert _{F}}{\tau _{k+1}-\tau _{k}}\right] . \end{aligned}$$
(15)

Convergence of the first two terms follows as in \(A_{T,k,1}\), the second is exactly covered in \(A_{T,k,1}\); however, the third term \(\eta _{\min }\le 3\Vert \sum _{t=\tau _{k}}^{\bar{\tau }_{k}-1}W^{(t)}\Vert _{F}/(\bar{\tau }_{k}-\tau _{k})\) requires some extra treatment. As \(\bar{\tau }_{k}<\tau _{k+1}\), we can relate the covariance matrix of the ground-truth (time-indexed) and block (indexed by k) such that \(\Sigma ^{(t)}=\Sigma _0^{(k)}\) for all \(t\in [\tau _{k},\tau _{k+1}]\). One can now write the average sampling error across time according to:

$$\begin{aligned} V^{(k)}_{|s^{(k)}|}:=\hat{S}^{(k)}_{|s^{(k)}|}-\Sigma _{0}^{(k)}, \end{aligned}$$

where

$$\begin{aligned} \hat{S}^{(k)}_{|s^{(k)}|}:=\frac{1}{|s^{(k)}|}\sum _{t\in {s^{(k)}}}X^{(t)}(X^{(t)})^{\top } \end{aligned}$$

and \(s^{(k)}\subseteq [\tau _k,\tau _{k+1}]\) is a subset of the kth changepoint interval. To simplify the notation, we will refer to the above quantities with a general subset of data \(n=|s^{(k)}|\), principally, this is because in a block the samples are i.i.d so only the length n distinguishes the quantity.

Lemma 2

(Error Bound on Empirical Covariance Matrix)

Let \(X^{(t)}\sim {\mathcal {N}}(0,\Sigma _{0}^{(k)})\), or more generally be sub-Gaussian with parameter \(\sigma _{t}\) for \(t=1,\ldots ,n\), then

$$\begin{aligned} P\left( \Vert V^{(k)}_{n}\Vert _{F}\ge \epsilon \right) \le 4p^{2}\exp \left\{ -n\epsilon ^{2}/p^{2}c_{1}\right\} , \end{aligned}$$
(16)

where \(c_{1}=\max _{i,t}\{\Sigma _{0;ii}^{(t)}\}2^{7}(1+4\max _{t}\{\sigma _{t}^{2}\})^{2}\) for all

$$\begin{aligned} \epsilon \le 2^{3}p\max _{i,t}\{\Sigma _{0;ii}^{(t)}\}(1+4\min _{t}\{\sigma _{t}^{2}\})^{2}/(1+4\max _{t}\{\sigma _{t}^{2}\}). \end{aligned}$$

The result is a corollary of relating the empirical covariance under sub-Gaussian sampling to the sum of sub-exponential random variables, see 1 for details.

Finally, let us turn to \(A_{T,k,3}\). Recall \(P(A_{T,k,3}):=P(A_{T,k}\cap C_{T}\cap E_{3}):=P(A_{T,k}\cap C_{T}\cap \{\Vert \sum _{t=\hat{\tau }_{k}}^{\tau _{k}-1}W^{(t)}\Vert _{F}\ge \Vert R_{1}\Vert _{F}/3\})\). Given that \(\Vert R_{1}\Vert _{F}\ge (\tau _{k}-\hat{\tau }_{k})\eta _{\min }\) with probability 1, an upper bound on \(P[A_{T,k,3}]\) can be found using the same concentration bounds (Lemma 2) as for \(A_{T,k,2}\). The only difference is that we need to replace the integration interval n with \(T\delta _{T}\). Noting that \(T\delta _{T}<\tau _{k}-\hat{\tau }_{k}\le d_{\min }/2\), the overall bound will be dominated by the concentration results requiring \(n>T\delta _{T}\).

1.3 Bounding the bad cases

In order to complete the proof, we need to demonstrate that \(P[A_{T,k}\cap C_{T}^{c}]\rightarrow 0\). Again, the argument below follows that of Harchaoui and Lévy-Leduc (2010), whereby the bad case is split into several events:

$$\begin{aligned} D_{T}^{(l)}:&= {} \left\{ \exists k\in [K],\;\hat{\tau }_{k}\le \tau _{k-1}\right\} \cap C_{T}^{c},\\ D_{T}^{(m)}:&= {} \left\{ \forall k\in [K],\;\tau _{k-1}<\hat{\tau }_{k}<\tau _{k+1}\right\} \cap C_{T}^{c},\\ D_{T}^{(r)}:&= {} \left\{ \exists k\in [K],\;\hat{\tau }_{k}\ge \tau _{k+1}\right\} \cap C_{T}^{c}, \end{aligned}$$

where \(C_{T}^{c}=\{\max _{k\in [K]}|\hat{\tau }_{k}-\tau _{k}|\ge d_{\min }/2\}\) is the complement of the good event. The events above correspond to estimating a changepoint; a) before the previous true changepoint (\(D_{T}^{(l)}\)); b) between the previous and next true changepoint (\(D_{T}^{(m)}\)), and c) after the next true changepoint (\(D_{T}^{(r)}\)). The events \(D_{T}^{(l)}\) and \(D_{T}^{(r)}\) appear to be particularly bad as the estimated changepoint is very far from the truth, due to symmetry we can bound these events in a similar manner. Focussing on the middle term \(P[A_{T,k}\cap D_{T}^{(m)}]\), let us again assume \(\hat{\tau }_{k}<\tau _{k}\) , the reverse arguments hold by symmetry.

The probability of the intersection of \(A_{T,k}\) and \(D_{T}^{(m)}\) can be bounded from above by considering the events

$$\begin{aligned} E_{k}^{'}&:= {} \{(\hat{\tau }_{k+1}-\tau _{k})\ge d_{\min }/2\},\end{aligned}$$
(17)
$$\begin{aligned} E_{k}^{''}&:= {} \{(\tau _{k}-\hat{\tau }_{k})\ge d_{\min }/2\}. \end{aligned}$$
(18)

In particular, one can demonstrate that:

$$\begin{aligned} P[A_{T,k}\cap D_{T}^{(m)}]\le P[A_{T,k}\cap E_{k}^{'}\cap D_{T}^{(m)}]+\sum _{j=k+1}^{K}P[E_{j}^{''}\cap E_{j}^{'}\cap D_{T}^{(m)}]. \end{aligned}$$
(19)

Let us first assess \(P[A_{T,k}\cap D_{T}^{(m)}\cap E_{k}^{'}]\), and consider the stationarity conditions (10) with start and end points set as \(l=\hat{\tau }_k\), \(l=\tau _{k}\) and \(l=\hat{\tau }_{k},l=\tau _{k+1}\) . We, respectively, obtain:

$$\begin{aligned} |\tau _{k}-\hat{\tau }_{k}|\Vert \Sigma _0^{(k)}-\hat{\Sigma }^{(k+1)}\Vert _{F}\le 2\lambda _{2}+\lambda _{1}\sqrt{p(p-1)}(\tau _{k}-\hat{\tau }_{k})+\bigg \Vert \sum _{t=\hat{\tau }_{k}}^{\tau _{k}-1}W^{(t)}\bigg \Vert _{F}\; \end{aligned}$$
(20)

and

$$\begin{aligned} |\tau _{k}-\hat{\tau }_{k+1}|\Vert \Sigma _0^{(k+1)}-\hat{\Sigma }^{(k+1)}\Vert _{F}\le 2\lambda _{2}+\lambda _{1}\sqrt{p(p-1)}(\hat{\tau }_{k+1}-\tau _{k})+\bigg \Vert \sum _{t=\tau _{k}}^{\hat{\tau }_{k+1}-1}W^{(t)}\bigg \Vert _{F}. \end{aligned}$$
(21)

Using the triangle inequality, we bound \(\Vert \Sigma _0^{(k+1)}-\Sigma _0^{(k)}\Vert _{F}\) conditional on \(E_{k}^{'}:=\{(\hat{\tau }_{k+1}-\tau _{k})\ge d_{\min }/2\}\) and \(A_{T,k}:=\{|\tau _{k}-\hat{\tau }_{k}|>T\delta _{T}\}\). Specifically, we construct the event

$$\begin{aligned} H_{T}^{\Sigma }&:= {} \{\Vert \Sigma _0^{(k+1)}-\Sigma _0^{(k)}\Vert _{F}\le 2\lambda _{1}\sqrt{p(p-1)}+2\lambda _{2}((T\delta _{T})^{-1}+2/d_{\min })\nonumber \\&\quad +\Vert V^{(k)}_{\tau _{k}-\hat{\tau }_{k}}\Vert _{F}+\Vert V^{(k+1)}_{\hat{\tau }_{k+1}-\tau _{k}}\Vert _{F}\}, \end{aligned}$$
(22)

which bounds the first term of (19) such that \(P[A_{T,k}\cap E_{k}^{'}\cap D_{T}^{(m)}]\le P[H_{T}^{\Sigma }\cap \{\tau _{k}-\hat{\tau }_{k}\ge T\delta _{T}\}\cap E_{k}^{'}]\). Splitting the intersection of events we now have five terms to consider

$$\begin{aligned}&P(A_{T,k}\cap E_{k}^{'}\cap D_{T}^{(m)})\\&\quad \le P(\lambda _{1}\sqrt{p(p-1)}\ge \eta _{\min }/10)+P(\lambda _{2}/T\delta _{T}\ge \eta _{\min }/10)+P(\lambda _{2}/d_{\min }\ge \eta _{\min }/20)\\&\qquad +\,P(\Vert V^{(k)}_{\tau _{k}-\hat{\tau }_{k}}\Vert _{F}\ge \eta _{\min }/5\}\cap \{\tau _{k}-\hat{\tau }_{k}\ge T\delta _{T}\})\\&\qquad +\,P(\{\Vert V^{(k+1)}_{\hat{\tau }_{k+1}-\tau _{k}}\Vert _{F}\ge \eta _{\min }/5\}\cap \{\hat{\tau }_{k+1}-\tau _{k}\ge d_{\min }/2\}). \end{aligned}$$

The stochastic error terms (containing \(V^{(k)}_{\tau _{k}-\hat{\tau }_{k}}\)) can then be shown to converge similarly to \(P(A_{T,k}\cap C_{T})\). Again, it is worth noting that the term involving \(T\delta _{T}\) will be slowest to converge, as \(d_{\min }=\gamma _{\min }T>\delta _{T}T\) for large T. The first three terms are bounded through the assumptions on \(d_{\min },\lambda _{1},\lambda _{2}\), and \(\delta _{T}\) as required by the theorem (and enforce a similar requirement to those used to bound \(P(A_{T,k,1})\) in Eq. 12). The other terms in (19), i.e. \(\sum _{j=k+1}^{K}P[E_{j}^{''}\cap E_{j}^{'}\cap D_{T}^{(m)}]\) can be similarly bounded. Instead of using exactly the event \(H_{T}^{\Sigma }\) one simply replaces the term \(1/T\delta _{T}\) in (22) with \(2/d_{\min }\).

Now let us consider the events \(D_{T}^{(l)}:=\left\{ \exists k\in [K],\;\hat{\tau }_{k}\le \tau _{k-1}\right\} \cap C_{T}^{c}\). The final step of the proof is to show that the bound on \(A_{T,k}\cap D_{T}^{(l)}\), and similarly \(A_{T,k}\cap D_{T}^{(r)}\) tends to zero. To achieve this, we introduce an upper bound derived by the combinatorics of estimated changepoints:

Claim

The probability of \(D_{T}^{(l)}\) is bounded by

$$\begin{aligned} P(D_{T}^{(l)})\le 2^{K}\sum _{k=1}^{K-1}\sum _{l\ge k}^{K-1}P(E_{l}^{''}\cap E_{l}^{'})+2^{K}P(E_{K}^{'}). \end{aligned}$$

We omit proof for brevity, however, remark that a similar argument is made in Lévy-Leduc (2010, Eq. 31). In order to bound the above probabilities, we relate the events \(E_{l}^{''}\)and \(E_{l}^{'}\) to the optimality conditions as before, via Eq. 10 . Setting \(k=l\) and invoking the triangle inequality gives us (similarly to Eq. 22), the event

$$\begin{aligned} J_{T}^{\Sigma _0}:=\left\{ \Vert \Sigma _0^{(l+1)}-\Sigma _0^{(l)}\Vert _{F}\le 2\lambda _{1}\sqrt{p(p-1)}+M+\Vert V^{(l)}_{\tau _{l}-\hat{\tau }_{l}}\Vert _{F}+\Vert V^{(l+1)}_{\hat{\tau }_{l+1}-\tau _{l}}\Vert _{F}\right\} , \end{aligned}$$

where \(M=2\lambda _{2}(|\tau _{l}-\hat{\tau }_{l}|^{-1}+|\hat{\tau }_{l+1}-\tau _{l}|^{-1})\). Conditioning on the event \(E_{l}^{''}\cap E_{l}^{'}\) implies that \(M=8\lambda _{2}/d_{\min }\). We can thus write

$$\begin{aligned} P(E_{l}^{''}\cap E_{l}^{'})&\le {} P(\eta _{\min }\le 8\lambda _{1}\sqrt{p(p-1)})+P(\eta _{\min }\le 32\lambda _{2}/d_{\min })\\&\quad +\,P(\{\Vert V^{(l)}_{\tau _{l}-\hat{\tau }_{l}}\Vert _{F}\ge \eta _{\min }/4\}\cap \{\tau _{l}-\hat{\tau }_{l}\ge d_{\min }/2\})\\&\quad +\,P(\{\Vert V^{(l+1)}_{\hat{\tau }_{l+1}-\tau _{l}}\Vert _{F}\ge \eta _{\min }/4\}\cap \{\hat{\tau }_{l+1}-\tau _{l}\ge d_{\min }/2\}). \end{aligned}$$

Finally, the term corresponding to the last changepoint can be bounded by noting that when \(k=K\) we have \(M=6\lambda _{2}/d_{\min }\), and

$$\begin{aligned} P(E_{K}^{''})&\le {} P(\eta _{\min }\le 8\lambda _{1}\sqrt{p(p-1)})+P(\eta _{\min }\le 24\lambda _{2}/d_{\min })\nonumber \\&\quad +\,P(\{\Vert V^{(K)}_{\tau _{K}-\hat{\tau }_{K}}\Vert _{F}\ge \eta _{\min }/4\}\cap \{\tau _{K}-\hat{\tau }_{K}\ge d_{\min }/2\})\nonumber \\&\qquad P(\Vert V^{(K+1)}_{T+1-\tau _{K}}\Vert _{F}\ge \eta _{\min }/4). \end{aligned}$$
(23)

1.4 Summary

The bounds derived in A1–A3 demonstrate that \(P(A_{T,k})\rightarrow 0\) since \(P(A_{T,k}\cap C_{T})\rightarrow 0\) and \(P(A_{T,k}\cap C_{T}^{c})\rightarrow 0\). However, to achieve these bounds, the regularisers must be set appropriately. The event \(E_{l}^{''}\cap E_{l}^{'}\) establishes a minimal condition on T in conjunction with \(\eta _{\min }\) and the regularisers, such that \(\eta _{\min }d_{\min }/\lambda _{2}>32\) and \(\eta _{\min }/\lambda _{1}\sqrt{p(p-1)}>8\). A final condition for \(A_{T,k,1}\) requires \(\eta _{\min }T\delta _{T}/\lambda _{2}>3\). Once T is large enough to satisfy these conditions, the probabilistic bound is determined either by the smallest block size \(d_{\min }=\gamma _{\min }T\) or by the minimum error \(T\delta _{T}\). Let \(k_{\infty }=\arg \max _{k}\{\max _{ii}\Sigma _{0;ii}^{(k)}\}\) select the block which results in the largest expected covariance error. Summing the probabilities, one obtains the upper bound:

$$\begin{aligned} P[|\tau _{k}-\hat{\tau }_{k}|\ge T\delta _{T}]&\le 2\times 2^{K}\big ((K-1)^{2}+1\big )P(\Vert V^{k_{\infty };d_{\min }/2}\Vert _{F}\ge \eta _{\min }/4)\\&\quad+\,2P(\Vert V^{(k_{\infty })}_{T\delta _{T}}\Vert _{F}\ge \eta _{\min }/5)\\&\quad+\,2P(\Vert V^{(k_{\infty })}_{T\delta _{T}}\Vert _{F}\ge \eta _{\min }/3), \end{aligned}$$

where the top row corresponds to \(D_{T}^{(l)}\) and \(D_{T}^{(r)}\); the middle \(D_{T}^{(m)}\), and the bottom \(A_{T,k,2}\) and \(A_{T,k,3}\). Since \(\delta _{T}T<\gamma _{\min }T\) for large T, the above bounds will be dominated by errors \(V^{(k_{\infty })}_{T\delta _{T}}\) integrated over the relatively small distance \(T\delta _{T}\). A suitable overall bound on the probability is

$$\begin{aligned} P(\max _{k\in [K]}|\tau _{k}-\hat{\tau }_{k}|\ge T\delta _{T})&\le K^{3}2^{K+1}P(\Vert V^{(k_{\infty })}_{d_{\min }/2}\Vert _{F}\ge \eta _{\min }/4)\\&\quad+4KP(\Vert V^{(k_{\infty })}_{T\delta _{T}}\Vert _{F}\ge \eta _{\min }/5)\\&\quad\le C_{K}P(\Vert V^{(k_{\infty })}_{T\delta _{T}}\Vert _{F}\ge \eta _{\min }/5). \end{aligned}$$

In the Gaussian case where \(\sigma _{t}=1\) for all t, we have

$$\begin{aligned} P(\max _{k\in [K]}|\tau _{k}-\hat{\tau }_{k}|\ge T\delta _{T})\le C_{K}4p^{2}\exp \{-T\delta _{T}\eta _{\min }^{2}/p^{2}c_{3}\} \end{aligned}$$

where \(C_{K}=K(K^{2}2^{K+1}+4)\), \(c_{3}=5^{4}2^{7}\Vert \Sigma _0^{(k_{\infty })}\Vert _{\infty }\) for all \(\eta _{\min }\le 2^{3}5^{2}p\Vert \Sigma _0^{(k_{\infty })}\Vert _{\infty }\). We thus arrive at the result of Theorem 1.

Appendix 2: Proof of model-selection consistency

Assumption 3

The event \(E_{\tau }:=\{\max _{k}|\hat{\tau }_{k}-\tau _{k}|\le T\delta _{T}\}\) holds with some increasing probability \(1-f_{\tau }(T)\rightarrow 1\) as \(T\rightarrow \infty \).

For the model selection consistency argument, we extend that presented in the stationary i.i.d setting as discussed in Ravikumar et al. (2011). Specifically, this follows the primal-dual witness method as described in Wainwright (2009). Since \(\{\hat{\Theta }^{(k)}\}_{k=1}^{B}\) is an optimal solution for GFGL, for each estimated block \(k,l=1,\ldots ,\hat{B}=K+1\) it needs to satisfy

$$\begin{aligned} \left\{ \sum _{l\ne k}^{\hat{B}}\hat{n}_{lk}(V^{(l)}_{\hat{n}_{lk}})\right\} +\hat{n}_{kk}V^{(k)}_{\hat{n}_{kk}}-\hat{n}_{k}\hat{\Sigma }^{(k)}+\lambda _{1}\hat{n}_{k}\hat{R}_{1}^{(\hat{\tau }_{k-1})}+\lambda _{2}(\hat{R}_{2}^{(\hat{\tau }_{k-1})}-\hat{R}_{2}^{(\hat{\tau }_{k})})=0, \end{aligned}$$
(24)

where \(\hat{n}_{lk}\) describes the proportion of overlap between the lth true block and the kth estimated block. The term \(\sum _{l\ne k\in [\hat{B}]}\hat{n}_{lk}(V^{(l)}_{\hat{n}_{lk}})\) can be though of as providing a sampling bias due to estimation error in the changepoints, whereas the term \(\hat{n}_{kk}V^{k;\hat{n}_{kk}}\) compares samples and the ground-truth of the same underlying covariance matrix.

We will now proceed to construct an oracle estimator \(\bar{\Theta }\in {\mathbb {M}}_{\hat{B}}:=\{U^{(k)}\in {\mathbb {R}}^{p\times p}|\,U_{{\mathcal {M}}^{\perp }}^{(k)}=0,U^{(k)}\succ 0\}_{k=1}^{\hat{B}}\). The oracle is constructed through solving the restricted problem

$$\begin{aligned} \bar{\Theta }:&= {} \underset{U\in {\mathbb {M}}_{\hat{B}}}{{\mathrm {arg\,min}}}\bigg [\sum _{k=1}^{\hat{B}}\big \{\sum _{l=1}^{\hat{B}}\hat{n}_{lk}{\mathrm {tr}}(\hat{S}^{(l)}_{\hat{n}_{lk}}U^{(k)})-\hat{n}_{k}\log \det (U^{(k)})\big \}+\lambda _{1}\sum _{k=1}^{\hat{B}}\hat{n}_{k}\Vert U^{(k)}\Vert _{1} \\&\quad +\,\lambda _{2}\sum _{k=2}^{\hat{B}}\Vert U^{(k)}-U^{(k-1)}\Vert _{F}\bigg ]. \end{aligned}$$

The construction above does not utilise oracle knowledge to enforce changepoint positions, only the sparsity structure of the block-wise precision matrices. Again, for each estimate block, we obtain a set of optimality conditions like (24). Let us denote the sub-gradient of the restricted problem evaluated at the oracle solution as \(\bar{R}_{1}^{(k)}\equiv \bar{R}_{1}^{(\hat{\tau }_{k-1})}\) for the \(\ell _{1}\) penalty, and \(\bar{R}_{2}^{(\hat{\tau }_{k-1})},\bar{R}_{2}^{(\hat{\tau }_{k})}\) for the smoothing components. By definition the matrices \(\bar{R}_{2}^{(\hat{\tau }_{k-1})},\bar{R}_{2}^{(\hat{\tau }_{k})}\) are members of the sub-differential and hence dual feasible. To show that \(\bar{\Theta }\) is also a minimiser of the unrestricted GFGL problem (2), we will show that \(\Vert \bar{R}_{1;{\mathcal {M}}^{\perp }}^{(k)}\Vert _{\infty }\le 1\) and is hence dual feasible.

Ravikumar et al. (2011, Lemma 4) demonstrate that for the standard graphical lasso problem strict dual feasibility can be obtained by bounding the maxima of both the sampling and estimation error. The estimation error (on the precision matrices) is tracked through the difference (remainder) between the gradient of the log-det loss function and its first-order Taylor expansion. In our case, we will track the precision matrices at each block k via the remainder function defined as

$$\begin{aligned} {\mathcal {E}}(\Delta )=\bar{\Theta }^{-1}-\Theta _{0}^{-1}+\Theta _{0}^{-1}\Delta \Theta _{0}^{-1}, \end{aligned}$$
(25)

where \(\Delta =\bar{\Theta }-\Theta _{0}\in {\mathbb {R}}^{p\times p}\).

Lemma 3

(Dual Feasibility) The out-of-subspace parameters are dual feasible such that \(\Vert \bar{R}_{1;{\mathcal {M}}^{\perp }}^{(k)}\Vert _{\infty }<1\) if

$$\begin{aligned} \max \left\{ \Vert \tilde{V}^{(k)}\Vert _{\infty },\Vert {\mathcal {E}}(\Delta )\Vert _{\infty },\lambda _{2}\hat{n}_{k}^{-1}\Vert \bar{R}_{2}^{(\hat{\tau }_{k-1})}\Vert _{\infty },\lambda _{2}\hat{n}_{k}^{-1}\Vert \bar{R}_{2}^{(\hat{\tau }_{k})}\Vert _{\infty }\right\} \le \alpha \lambda _{1}/16, \end{aligned}$$
(26)

where

$$\begin{aligned} \tilde{V}^{(k)}:=\hat{n}_{k}^{-1}\left( \sum _{l\ne k}^{\hat{B}}\hat{n}_{lk}V^{(l)}_{\hat{n}_{lk}}+\hat{n}_{kk}V^{(k)}_{\hat{n}_{kk}}\right) . \end{aligned}$$
(27)

We note at this point, that the condition (26) in the setting where \(T\rightarrow \infty \) converges to that of the standard graphical lasso Ravikumar et al. (2011). Specifically, if changepoint error is bounded according to the event \(E_{\tau }:=\{\max _{k}|\hat{\tau }_{k}-\tau _{k}|\le T\delta _{T}\}\), the mis-specification error averaged across the block converges to the exact case \(\tilde{V}^{(k)}\rightarrow V_{\mathrm {exact}}^{(k)}\), where exact refers to the setting with zero changepoint estimation error.

Lemma 4

The average sampling error over estimated block k is bounded for some \(\epsilon <c_{5}:=2^{3}5\Vert \Sigma ^{(k_{\infty })}_0\Vert _{\infty }\) according to

$$\begin{aligned} P[\Vert \tilde{V}^{(k)}\Vert _{\infty }>\epsilon ]\le 4p^{2}e^{-\epsilon ^{2}\hat{n}_{k}/c_{4}}, \end{aligned}$$

for \(c_{4}=2^{7}5^{2}\Vert \Sigma ^{(k_{\infty })}_0\Vert _{\infty }\) .

Applying this result to (26) and making the choice of regulariser \(\lambda _{1}=16\alpha ^{-1}\epsilon \), enables the condition \(\Vert \tilde{V}^{(k)}\Vert _{\infty }\le \alpha \lambda _{1}/16\) in (26) to be satisfied in high probability. Specifically, we have

$$\begin{aligned} P[\Vert \tilde{V}^{(k)}\Vert _{\infty }>\alpha \lambda _{1}2^{-4}]\le f_{V}(T), \end{aligned}$$

where \(f_{V}(T)=4p^{2}\exp \{-c_{4}^{-1}\epsilon ^{2}(\gamma _{\min }-2\delta _{T})T\}\), and note that \(f_{V}(T)\rightarrow 0\) as \(T\rightarrow \infty \).

We now turn our attention to the size of the remainder \(\Vert {\mathcal {E}}(\Delta )\Vert _{\infty }\). In the first step, we directly invoke a result from Ravikumar et al. (2011):

Lemma 5

If the bound \(\Vert \Delta \Vert _{\infty }\le (3K_{\Sigma _{0}}d)^{-1}\) holds and d is the maxmimum node degree, then

$$\begin{aligned} \Vert {\mathcal {E}}(\Delta )\Vert _{\infty }\le \frac{3}{2}d\Vert \Delta \Vert _{\infty }^{2}K_{\Sigma _{0}}^{3}. \end{aligned}$$

While we can use the same relation as Ravikumar et al. (2011) to map \(\Vert \Delta \Vert _{\infty }\) to \(\Vert {\mathcal {E}}(\Delta )\Vert _{\infty }\) we need to modify our argument for the actual control on \(\Vert \Delta \Vert _{\infty }\).

Lemma 6

The elementwise \(\ell _{\infty }\) norm of the error is bounded such that \(\Vert \bar{\Delta }\Vert _{\infty }=\Vert \bar{\Theta }-\Theta _{0}\Vert _{\infty }\le r\) if

$$\begin{aligned} r:=2K_{\Gamma _{0}}\{\Vert \tilde{V}^{(k)}\Vert _{\infty }+\lambda _{1}+\lambda _{2}\hat{n}_{k}^{-1}(\Vert \bar{R}_{2}^{(\hat{\tau }_{k-1})}\Vert _{\infty }+\Vert \bar{R}_{2}^{(\hat{\tau }_{k})}\Vert _{\infty })\}, \end{aligned}$$
(28)

and \(r\le \min \{(3K_{\Sigma _{0}}d)^{-1},(3K_{\Sigma _{0}}^{3}K_{\Gamma _{0}}d)^{-1}\}\).

Note that the contribution of the fused sub-gradient is bounded \(\lambda _{2}\hat{n}_{k}^{-1}(\Vert \bar{R}_{2}^{(\hat{\tau }_{k-1})}\Vert _{\infty }+\Vert \bar{R}_{2}^{(\hat{\tau }_{k})}\Vert _{\infty })\le 2\lambda _{2}\hat{n}_{k}^{-1}\). Let us further assume that \(\lambda _{2}=\lambda _{1}\rho \) for \(\rho >0\), we now upper bound (28) with the stated form of \(\lambda _{1}\) such that

$$\begin{aligned} r\le r_{E_{V}}:=2K_{\Gamma _{0}}\{\epsilon +\lambda _{1}(1+2\rho \hat{n}_{k}^{-1})\}=2K_{\Gamma _{0}}z_{\hat{n}_{k}}\epsilon , \end{aligned}$$

where

$$\begin{aligned} z_{\hat{n}_{k}}:=1+16\alpha ^{-1}(1+2\rho \hat{n}_{k}^{-1}). \end{aligned}$$

We have two constraints on \(\epsilon \), one from the concentration bound (Lemma 5) whereby \(\epsilon \le 2^{3}5\Vert \Sigma _0^{(k_{\infty })}\Vert _{\infty }\), then a second from Lemma 6 gives

$$\begin{aligned} r\le r_{E_{V}}\le \min \{(3K_{\Sigma _{0}}d)^{-1},(3K_{\Sigma _{0}}^{3}K_{\Gamma _{0}}d)^{-1}\} \end{aligned}$$

which implies \(\epsilon \le 1/v_{E_{V}}\) where \(v_{E_{V}}:=6dz_{\hat{n}_{k}}\max \{K_{\Sigma _{0}}K_{\Gamma _{0}},K_{\Sigma _{0}}^{3}K_{\Gamma _{0}}^{2}\}\). Now lets use Lemma 5 to obtain

$$\begin{aligned} \Vert {\mathcal {E}}(\Delta )\Vert _{\infty }&\le {} \frac{3}{2}d\Vert \Delta \Vert _{\infty }^{2}K_{\Sigma _{0}}^{3}\\&\le {} 6dK_{\Gamma _{0}}^{2}K_{\Sigma _{0}}^{3}z_{\hat{n}_{k}}^{2}\epsilon ^{2}\;\\&= {} \left[ 6dK_{\Gamma _{0}}^{2}K_{\Sigma _{0}}^{3}z_{\hat{n}_{k}}^{2}\epsilon \right] 2^{-4}\lambda _{1}\alpha , \end{aligned}$$

where the last line comes from setting \(\epsilon =\lambda _{1}\alpha /16\). To demonstrate dual feasibility, we therefore have to satisfy the further constraint that \(\epsilon \le 1/v_{{\mathcal {E}}}\) where \(v_{{\mathcal {E}}}:=6dK_{\Gamma _{0}}^{2}K_{\Sigma _{0}}^{3}z_{\hat{n}_{k}}^{2}\). Dual feasibility is obtained in the case where

$$\begin{aligned} \epsilon \in \left( 0,\min \left\{ \frac{1}{2^{3}5\Vert \Sigma _0^{(k_{\infty })}\Vert _{\infty }},\frac{1}{6dz_{\hat{n}_{k}}^{2}\max \{K_{\Sigma _{0}}K_{\Gamma _{0}},K_{\Sigma _{0}}^{3}K_{\Gamma _{0}}^{2}\}}\right\} \right) . \end{aligned}$$

Consider the lower bound on \(\hat{n}_{k}>(\gamma _{\min }-2\delta _{T})T\rightarrow \gamma _{\min }T\), as \(T\rightarrow \infty \) then the term \(z_{\hat{n}_{k}}\rightarrow 1+16\alpha ^{-1}\). For convergence of the tail bound (Lemma 5) we need

$$\begin{aligned} \epsilon ^{2}T>c_{4}\log (4p^{2})\big /(\gamma _{\min }-2\delta _{T}). \end{aligned}$$

Thus, we can choose a rate \(\epsilon =\Omega (T^{-1/2})\) and still maintain dual feasibility.

The final step of the proof is to demonstrate that all possible solutions to GFGL maintain this relation. In the case of GFGL, the following lemma states that the objective function is strictly convex on the positive-definite cone. Hence, if we find a minima it is the global minima, and the dual-feasibility condition ensures that the suggested bounds are achieved.

Lemma 7

(Strict Convexity) For matrices \(\Theta _{T}\in {\mathcal {S}}_{++}^{T}:=\{\{U^{(t)}\}_{t=1}^{T}\,|\,U^{(t)}\succ 0\,,\,U^{(t)}=U^{(t)\top }\}\) the GFGL cost function is strictly convex.

Appendix 3: Proof of Lemmata

3.1 Bounds for empirical covariance error

Proof

(Proof of Lemma 2) We want to bound \(P[\Vert V^{(k)}_n\Vert _{\infty }\ge \epsilon \)]. Our approach is to extend the single block tail bound of Ravikumar et al. (2011), derived in an i.i.d setting, to the case where samples are independent, but not necessarily identically sampled. Without loss of generality consider the event

$$\begin{aligned} {\mathbb {A}}_{ij}(v):=\left\{ \bigg |\sum _{t=1}^{n}X_{i}^{(t)}X_{j}^{(t)}-\Sigma _{0;ij}^{(t)}\bigg |>d\right\} . \end{aligned}$$
(29)

Recall, the deviation of a sub-exponential random variable Z with \(E[Z]=\mu _{Z}\) is given by the inequality

$$\begin{aligned} P(Z\ge \mu _{Z}+v)\le {\left\{ \begin{array}{ll} \exp (\frac{-v^{2}}{2\gamma ^{2}}) &{}\quad {\mathrm {if}}\;0\le v\le \gamma ^{2}/b\\ \exp (-\frac{v}{2b}) &{}\quad {\mathrm {if}}\;v>\gamma ^{2}/b \end{array}\right. }. \end{aligned}$$
(30)

For independent \(Z^{(t)}\), sub-exponentials with parameters \((\gamma _{t}^{2},b_{t})\) the sum \(Z_{ij}=Z_{ij}^{(1)}+\ldots +Z_{ij}^{(t)}\) is sub-exponential with parameters \((\sum _{t}\gamma _{t}^{2},\max _{t}b_{t})\). Now consider the event \({\mathbb {A}}_{ij}(v)\), and construct the auxilary variables \(A_{ij}^{(t)}:=\check{X}_{i}^{(t)}+\check{X}_{j}^{(t)}\) and \(B_{ij}^{(t)}:=\check{X}_{i}^{(t)}-\check{X}_{j}^{(t)}\) where \(\check{X}_{i}^{(t)}=X_{i}^{(t)}/\sqrt{\Sigma _{0;ii}^{(t)}}\). We note that Lemma 9 of Ravikumar et al. (2011) holds at the individual t step level such that, if \(\bar{X}_{i}^{(t)}\) is sub-Gaussian with parameter \(\sigma _{t}\), then the random variables \(A_{ij}^{(t)}\) and \(B_{ij}^{(t)}\) are sub-Gaussian with parameter \(2\sigma _{t}\), and for all \(d>0\)

$$\begin{aligned} P[{\mathbb {A}}_{ij}(v)]&\le {} P\left[ |\sum _{t=1}^{n}(A_{ij}^{(t)})^{2}-2(1-\rho _{ij}^{*})|>2d\big /\sqrt{\Sigma _{0;ii}^{(t)}\Sigma _{0;jj}^{(t)}}\right] \\&+\,P\left[ |\sum _{t=1}^{n}(B_{ij}^{(t)})^{2}-2(1-\rho _{ij}^{*})|>2d\big /\sqrt{\Sigma _{0;ii}^{(t)}\Sigma _{0;jj}^{(t)}}|\right] . \end{aligned}$$

Through Lemma 10 (Ravikumar et al. 2011) we have \(Z_{ij}^{(t)}:=(A_{ij}^{(t)})^{2}-2(1+\rho _{ij}^{(t)})\) is sub-exponential with parameter \(\gamma =b=16(1+4\sigma ^{2})\). Application of (30) with the tighter bound (\(0\le v\le \gamma ^{2}/b\)), gives us

$$\begin{aligned} P\left[ \left| \sum _{t=1}^{n}(A_{ij}^{(t)})^{2}-2(1-\rho _{ij}^{*})\right| >2d\big /\sqrt{\Sigma _{0;ii}^{(t)}\Sigma _{0;jj}^{(t)}}\right] \le 2\exp \left\{ -d^{2}/nc_{1}\right\} , \end{aligned}$$

where \(c_{1}=\max _{i,t}\{\Sigma _{0;ii}^{(t)}\}2^{7}(1+4\max _{t}\{\sigma _{t}^{2}\})^{2}\) for all \(d\le nc_{2}\) where \(c_{2}=2^{3}\max _{i,t}\{\Sigma _{0;ii}^{(t)}\}(1+4\min _{t}\{\sigma _{t}^{2}\})^{2}/(1+4\max _{t}\{\sigma _{t}^{2}\})\). A bound on the required quantity \(P(\Vert B^{(k)}_n\Vert _{\infty }\ge \epsilon )\) follows from application of the union bound over both A and B, and then further over the individual elements in the \(p\times p\) matrices. In a general setting, we obtain

$$\begin{aligned} P\left[ \left\| \sum _{t=1}^{n}(\check{X}^{(t)})^{\top }\check{X}^{(t)}-\Sigma ^{(t)}_0\right\| _{\infty }>d\right] \le 4p^{2}\exp \left\{ -d^{2}/nc_{1}\right\} , \end{aligned}$$
(31)

where \(c_{1}=\max _{i,t}\{\Sigma _{0;ii}^{(t)}\}2^{7}(1+4\max _{t}\{\sigma _{t}^{2}\})^{2}\) for all

$$\begin{aligned} d\le 2^{3}n\max _{i,t}\{\Sigma _{0;ii}^{(t)}\}(1+4\min _{t}\{\sigma _{t}^{2}\})^{2}/(1+4\max _{t}\{\sigma _{t}^{2}\}). \end{aligned}$$

The required bound is given by the inequality \(\Vert X\Vert _{F}\le p\Vert X\Vert _{\infty }\) and setting \(d=\epsilon n\) with \(\epsilon \le c_{2}\). \(\square \)

Proof

(Proof of Lemma 4) For the quantity \(\Vert \tilde{V}^{(k)}\Vert _{\infty }\) we consider adapting the proof of Lemma 2. A bound on this can be derived from (31) setting \(n=\hat{n}_{k}\), \(d=\epsilon \hat{n}\) letting \(\max _{i,t}\{\Sigma _{0;ii}^{(t)}\}=\Vert \Sigma _0^{(k_{\infty })}\Vert _{\infty }\) and \(\max _{t}\{\sigma _{t}^{2}\}=\min _{t}\{\sigma _{t}^{2}\}=1\), i.e. all variates are Gaussian. We then obtain

$$\begin{aligned} P[\Vert \tilde{V}^{(k)}\Vert _{\infty }>\epsilon ]\le 4p^{2}\exp \{-\epsilon ^{2}\hat{n}_{k}/c_{4} \}, \end{aligned}$$

for \(c_{4}=2^{7}5^{2}\Vert \Sigma _0^{(k_{\infty })}\Vert _{\infty }\) and \(\epsilon <c_{5}:=2^{3}5\Vert \Sigma _0^{(k_{\infty })}\Vert _{\infty }\). \(\square \)

3.2 Dual-feasibility with mis-specification (Proof of Lemma 3)

Proof

We can write the block-wise optimality conditions (24) for the restricted estimator as

$$\begin{aligned}&(\Theta _{0}^{(k)})^{-1}\Delta ^{(k)}(\Theta _{0}^{(k)})^{-1}-{\mathcal {E}}(\Delta ^{(k)})+\frac{1}{\hat{n}_{k}}\left( \sum _{l\ne k}^{\hat{B}}\hat{n}_{lk}V^{(l)}_{\hat{n}_{lk}}+\hat{n}_{kk}V^{(k)}_{\hat{n}_{kk}}\right) +\lambda _{1}\bar{R}_{1}^{(k)}\\&\quad +\,\frac{\lambda _{2}}{\hat{n}_{k}}(\bar{R}_{2}^{(\hat{\tau }_{k-1})}-\bar{R}_{2}^{(\hat{\tau }_{k})})=0. \end{aligned}$$

As pointed out in Ravikumar et al. (2011), this equation may be written as an ordinary linear equation by vectorising the matrices, for instance \({\mathrm {vec}}\{(\Theta _{0}^{(k)})^{-1}\Delta ^{(k)}(\Theta _{0}^{(k)})^{-1}\}=\{(\Theta _{0}^{(k)})^{-1}\otimes (\Theta _{0}^{(k)})^{-1}\}{\mathrm {vec}}(\Delta ^{(k)})\equiv \Gamma _{0}{\mathrm {vec}}(\Delta )\). Utilising the fact \(\Delta _{{\mathcal {M}}^{\perp }}=0\) we can split the optimality conditions into two blocks of linear equations

$$\begin{aligned} \Gamma _{0;{\mathcal {M}}{\mathcal {M}}}^{(k)}{\mathrm {vec}}(\Delta _{{\mathcal {M}}}^{(k)})+{\mathrm {vec}}(G^{(k)}_{\hat{n}_k}(X;\lambda _{1},\lambda _{2})_{{\mathcal {M}}})&=0\;\end{aligned}$$
(32)
$$\begin{aligned} \Gamma _{0;{\mathcal {M}}^{\perp }{\mathcal {M}}}^{(k)}{\mathrm {vec}}(\Delta _{{\mathcal {M}}}^{(k)})+{\mathrm {vec}}(G^{(k)}_{\hat{n}_k}(X;\lambda _{1},\lambda _{2})_{{\mathcal {M}}^{\perp }})&=0, \end{aligned}$$
(33)

where

$$\begin{aligned} G^{(k)}_{\hat{n}_k}(X;\lambda _{1},\lambda _{2}):=\tilde{V}^{(k)}-{\mathcal {E}}(\Delta ^{(k)})+\lambda _{1}\bar{R}_{1}^{(k)}+\hat{n}_{k}^{-1}\lambda _{2}(\bar{R}_{2}^{(\hat{\tau }_{k-1})}-\bar{R}_{2}^{(\hat{\tau }_{k})}). \end{aligned}$$

Solving (32) for \({\mathrm {vec}}(\Delta _{{\mathcal {M}}}^{(k)})\) we find \({\mathrm {vec}}(\Delta _{{\mathcal {M}}}^{(k)})=-(\Gamma _{0;{\mathcal {M}}{\mathcal {M}}}^{(k)})^{-1}{\mathrm {vec}}\{G^{(k)}_{\hat{n}_k}(X;\lambda _{1},\lambda _{2})_{{\mathcal {M}}}\}\). Substituting this into (33) and re-arranging for \(\bar{R}_{1;{\mathcal {M}}^{\perp }}\) gives

$$\begin{aligned} {\mathrm {vec}}(G^{(k)}_{\hat{n}_k}(X;\lambda _{1},\lambda _{2})_{{\mathcal {M}}^{\perp }})=\Gamma _{0;{\mathcal {M}}^{\perp }{\mathcal {M}}}^{(k)}(\Gamma _{0;{\mathcal {M}}{\mathcal {M}}}^{(k)})^{-1}{\mathrm {vec}}\{G^{(k)}_{\hat{n}_k}(X;\lambda _{1},\lambda _{2})_{{\mathcal {M}}}\}, \end{aligned}$$

and thus, letting \(H^{(k)}:=\Gamma _{0;{\mathcal {M}}^{\perp }{\mathcal {M}}}^{(k)}(\Gamma _{0;{\mathcal {M}}{\mathcal {M}}}^{(k)})^{-1}\) we obtain

$$\begin{aligned} \bar{R}_{1;{\mathcal {M}}^{\perp }}^{(k)}&=\frac{1}{\lambda _{1}}H^{(k)}{\mathrm {vec}}\{\tilde{V}^{(k)}_{{\mathcal {M}}}-{\mathcal {E}}_{{\mathcal {M}}}(\Delta ^{(k)})\}+\frac{\lambda _{2}}{\hat{n}_{k}\lambda _{1}}H^{(k)}{\mathrm {vec}}\{(\bar{R}_{2}^{(\hat{\tau }_{k-1})}-\bar{R}_{2}^{(\hat{\tau }_{k})})_{{\mathcal {M}}}\}+H^{(k)}{\mathrm {vec}}(\bar{R}_{1;{\mathcal {M}}}^{(k)})\\&\quad -{{\frac{1}{\lambda _{1}}{\text{vec}}}}\{\tilde{V}^{(k)}_{{\mathcal {M}}^{\perp }}-{\mathcal {E}}_{{\mathcal {M}}^{\perp }}(\Delta ^{(k)})\}-\frac{\lambda _{2}}{\hat{n}_{k}\lambda _{1}}{\mathrm {vec}}\{(\bar{R}_{2}^{(\hat{\tau }_{k-1})}-\bar{R}_{2}^{(\hat{\tau }_{k})})_{{\mathcal {M}}^{\perp }}\}. \end{aligned}$$

Taking the \(\ell _{\infty }\) norm of both sides gives

$$\begin{aligned} \Vert \bar{R}_{1;{\mathcal {M}}^{\perp }}^{(k)}\Vert _{\infty }&\le \frac{1}{\lambda _{1}}{\mathopen {|||}} H^{(k)} {\mathclose {|||}} _{\infty }(\Vert \tilde{V}^{(k)}_{{\mathcal {M}}}\Vert _{\infty }+\Vert {\mathcal {E}}_{{\mathcal {M}}}(\Delta ^{(k)})\Vert _{\infty })+\Vert H^{(k)}{\mathrm {vec}}(\bar{R}_{1;{\mathcal {M}}}^{(k)})\Vert _{\infty }\\&\quad +\frac{1}{\lambda _{1}}(\Vert \tilde{V}^{(k)}_{{\mathcal {M}}^{\perp }}\Vert _{\infty }+\Vert {\mathcal {E}}_{{\mathcal {M}}^{\perp }}(\Delta ^{(k)})\Vert _{\infty })\\&\quad +\frac{\lambda _{2}}{\hat{n}_{k}\lambda _{1}}\left\{ {\mathopen {|||}} H^{(k)} {\mathclose {|||}} _{\infty }(\Vert \bar{R}_{2;{\mathcal {M}}}^{(\hat{\tau }_{k-1})}\Vert _{\infty }+\Vert \bar{R}_{2;{\mathcal {M}}}^{(\hat{\tau }_{k})}\Vert _{\infty })+\Vert \bar{R}_{2;{\mathcal {M}}^{\perp }}^{(\hat{\tau }_{k-1})}\Vert _{\infty }+\Vert \bar{R}_{2;{\mathcal {M}}^{\perp }}^{(\hat{\tau }_{k})}\Vert _{\infty }\right\} . \end{aligned}$$

\(\square \)

Claim

The error in the model-space dominates that outside such that

$$\begin{aligned} \Vert \tilde{V}^{(k)}_{{\mathcal {M}}^{\perp }}\Vert _{\infty }&\le \Vert \tilde{V}^{(k)}_{{\mathcal {M}}}\Vert _{\infty },\end{aligned}$$
(34)
$$\begin{aligned} \Vert {\mathcal {E}}_{{\mathcal {M}}^{\perp }}(\Delta ^{(k)})\Vert _{\infty }&\le \Vert {\mathcal {E}}_{{\mathcal {M}}}(\Delta ^{(k)})\Vert _{\infty }. \end{aligned}$$
(35)

Furthermore, the maximum size of the sub-gradient in the model subspace is bounded \(\Vert \bar{R}_{1;{\mathcal {M}}}^{(k)}\Vert _{\infty }\le 1\).

A similar claim in made in Ravikumar et al. (2011), and thus via the results above, we obtain \(\Vert H^{(k)}{\mathrm {vec}}(\bar{R}_{1;{\mathcal {M}}}^{(k)})\Vert _{\infty }\le 1-\alpha \) and

$$\begin{aligned} \Vert \bar{R}_{1;{\mathcal {M}}^{\perp }}^{(k)}\Vert _{\infty }&\le \lambda _{1}^{-1}(2-\alpha )\{\Vert \tilde{V}^{(k)}_{{\mathcal {M}}}\Vert _{\infty }+\Vert {\mathcal {E}}_{{\mathcal {M}}}(\Delta ^{(k)})\Vert _{\infty }\\&\quad +\lambda _{2}\hat{n}_{k}^{-1}(\Vert \bar{R}_{2;{\mathcal {M}}}^{(\hat{\tau }_{k-1})}\Vert _{\infty }+\Vert \bar{R}_{2;{\mathcal {M}}}^{(\hat{\tau }_{k})}\Vert _{\infty })\}+\Vert H^{(k)}{\mathrm {vec}}(\bar{R}_{1;{\mathcal {M}}}^{(k)})\Vert _{\infty }. \end{aligned}$$

The condition (26) stated in the lemma now ensures \(\Vert \bar{R}_{1;{\mathcal {M}}^{\perp }}^{(k)}\Vert _{\infty }<1\).

3.3 Control of estimation error (Proof of Lemma 6)

Note that \(\bar{\Theta }^{(k)}_{{\mathcal {M}}^{\perp }}=\Theta ^{(k)}_{0;{\mathcal {M}}^{\perp }}=0\) and thus \(\Vert \Delta ^{(k)}\Vert _{\infty }=\Vert \Delta ^{(k)}_{{\mathcal {M}}}\Vert _{\infty }\). We follow Lemma 6 from Ravikumar et al. (2011) in the spirit of our proof. The first step is to characterise the solution \(\bar{\Theta }_{{\mathcal {M}}}\) in terms of its zero-gradient condition (of the restricted oracle problem). Define a function to represent the block-wise optimality conditions [akin to Eq. 75 Ravikumar et al. (2011)]

$$\begin{aligned} Q(\Theta _{{\mathcal {M}}}^{(k)})=-(\Theta _{{\mathcal {M}}}^{(k)})^{-1}+\hat{n}_{k}^{-1}\left( \sum _{t=\hat{\tau }_{k-1}}^{\hat{\tau }_{k}-1}\hat{S}_{{\mathcal {M}}}^{(t)}\right) +\lambda _{1}\bar{R}_{1}^{(k)}+\lambda _{2}\hat{n}_{k}^{-1}(\bar{R}_{2}^{(\hat{\tau }_{k-1})}-\bar{R}_{2}^{(\hat{\tau }_{k})})=0. \end{aligned}$$

Now construct a map \(F:\Delta ^{(k)}_{{\mathcal {M}}}\mapsto F(\Delta ^{(k)}_{{\mathcal {M}}})\) such that its fixed points are equivalent to the zeros of the gradient expression in terms of \(\Delta ^{(k)}_{{\mathcal {M}}}\). To simplify the analysis, let us work with the vectorised form and define the map

$$\begin{aligned} F({\mathrm {vec}}(\Delta ^{(k)}_{{\mathcal {M}}})):=-(\Gamma _{0;{\mathcal {M}}{\mathcal {M}}})^{-1}{\mathrm {vec}}\{Q(\Theta _{{\mathcal {M}}}^{(k)})\}+{\mathrm {vec}}(\Delta ^{(k)}_{{\mathcal {M}}}), \end{aligned}$$

such that \(F\{{\mathrm {vec}}(\Delta ^{(k)}_{{\mathcal {M}}})\}={\mathrm {vec}}(\Delta ^{(k)}_{{\mathcal {M}}})\) iff \(Q(\Theta _{0;{\mathcal {M}}}^{(k)}+\Delta ^{(k)}_{{\mathcal {M}}})=Q(\Theta _{{\mathcal {M}}}^{(k)})=0\). Now, to ensure all solutions that satisfy the zero gradient expression may have their error bounded within the ball we demonstrate that F maps a \(\ell _{\infty }\) ball \({\mathbb {B}}(r):=\{\Theta ^{(k)}_{\mathcal {M}}|\,\Vert \Theta ^{(k)}_{\mathcal {M}}\Vert _{\infty }\le r\}\) onto itself. Expanding \(F({\mathrm {vec}}(\Delta ^{(k)}_{\mathcal {M}}))\), we find

$$\begin{aligned} F({\mathrm {vec}}(\Delta ^{(k)}_{\mathcal {M}}))&=-(\Gamma _{0;{\mathcal {M}}{\mathcal {M}}})^{-1}{\mathrm {vec}}\{Q(\Theta _{0;{\mathcal {M}}}^{(k)}+\Delta ^{(k)}_{{\mathcal {M}}})\}+{\mathrm {vec}}(\Delta ^{(k)}_{{\mathcal {M}}})\\&=T_{1}-T_{2}, \end{aligned}$$

where

$$\begin{aligned} T_{1}&:=(\Gamma _{0;{\mathcal {M}}{\mathcal {M}}})^{-1}{\mathrm {vec}}\big [\{(\Theta _{0}^{(k)}+\Delta ^{(k)})^{-1}-(\Theta _{0}^{(k)})^{-1}\}_{\mathcal {M}}+{\mathrm {vec}}(\Delta ^{(k)}_{\mathcal {M}})\\ T_{2}&:=(\Gamma _{0;{\mathcal {M}}{\mathcal {M}}})^{-1}{\mathrm {vec}}\big [{\tilde{V}^{(k)}}_{\mathcal {M}}+\lambda _{1}\bar{R}_{1;{\mathcal {M}}}^{(k)}+\lambda _{2}\hat{n}_{k}^{-1}(\bar{R}_{2;{\mathcal {M}}}^{(\hat{\tau }_{k-1})}-\bar{R}_{2;{\mathcal {M}}}^{(\hat{\tau }_{k})})\big ]. \end{aligned}$$

The rest of the proof follows from Ravikumar et al. (2011), where via Lemma 5, one can show

$$\begin{aligned} \Vert T_{1}\Vert _{\infty }\le \frac{3}{2}dK_{\Sigma _{0}}^{3}K_{\Gamma _{0}}\Vert \Delta ^{(k)}\Vert _{\infty }^{2}\le \frac{3}{2}dK_{\Sigma _{0}}^{3}K_{\Gamma _{0}}r^{2}, \end{aligned}$$

under the assumptions of the lemma we obtain \(\Vert T_{1}\Vert \le r/2\). Combined with the stated form of r, we also find \(\Vert T_{2}\Vert _{\infty }\le r/2\) and thus \(\Vert F({\mathrm {vec}}(\Delta ^{(k)}_{\mathcal {M}}))\Vert _{\infty }\le r\). Through the construction of F, we have \(\Vert \Delta ^{(k)}_{\mathcal {M}}\Vert _{\infty }\le r\) iff \(Q(\Theta _{0;{\mathcal {M}}}^{(k)}+\Delta ^{(k)}_{\mathcal {M}})=Q(\Theta _{\mathcal {M}}^{(k)})=0\) and since \(Q(\bar{\Theta }_{\mathcal {M}}^{(k)})=0\) for any \(\bar{\Theta }_{\mathcal {M}}^{(k)}\) we obtain \(\Vert \bar{\Delta }^{(k)}_{\mathcal {M}}\Vert _{\infty }\le r\) where \(\bar{\Delta }^{(k)}_{\mathcal {M}}:=\bar{\Theta }^{(k)}-\Theta _{0}^{(k)}\). Finally, the existence of a solution \(\bar{\Theta }_{\mathcal {M}}^{(k)}\) corresponding to \({\mathrm {vec}}(\bar{\Delta }^{(k)}_{\mathcal {M}})\in {\mathbb {B}}(r)\) is guaranteed by Brouwer’s fixed point theorem.

Appendix 4: Further results

4.1 Strict convexity of GFGL (Proof of Lemma 7)

Proof

The negative log-det barrier \(-\log \det (\Theta ^{(t)})\) is strictly convex on \(\Theta ^{(t)}\in {\mathcal {S}}_{++}^{1}\). While the frobenius norm is strictly convex on a given matrix \(\Vert A\Vert _{F}\) for \(A\in {\mathcal {S}}_{++}^{1}\) it is not strictly convex when considering the mixed norm \(\sum _{t=2}^{T}\Vert \Theta ^{(t)}-\Theta ^{(t-1)}\Vert _{F}\) for \(\Theta ^{(t)}\in {\mathcal {S}}_{++}^{(T)}\). However, due to Lagrangian duality, we can re-write the GFGL problem as an explicitly constrained problem

$$\begin{aligned}&\min _{\Theta _{T}\in {\mathcal {S}}_{++}^{(T)}}\big \{\sum _{t=1}^{T}\big [\langle \Theta ^{(t)},\hat{S}^{(t)}\rangle -\log \det (\Theta ^{(t)})\big ]\big \} {\text{ such that }} \\&\sum _{t=1}^{T}\Vert \Theta _{-ii}^{(t)}\Vert _{1}+\frac{\lambda _{2}}{\lambda _{1}}\sum _{t=2}^{T}\Vert \Theta ^{(t)}-\Theta ^{(t-1)}\Vert _{F}\le C(\lambda _{1}). \end{aligned}$$

We can alternatively write

$$\begin{aligned}&\min _{\Theta _{T}\in {\mathcal {S}}_{++}^{(T)}}\big \{\sum _{t=1}^{T}\big [\langle \Theta ^{(t)},\hat{S}^{(t)}\rangle -\log \det (\Theta ^{(t)})\big ]\big \}\\ {\mathrm {such\,that}}&\quad \sum _{t=1}^{T}\Vert \Theta _{-ii}^{(t)}\Vert _{1}\le C_{\mathrm {sparse}}(\lambda _{1},\lambda _{2})\quad {\mathrm {and}}\quad \sum _{t=2}^{T}\Vert \Theta ^{(t)}-\Theta ^{(t-1)}\Vert _{F}\le C_{\mathrm {smooth}}(\lambda _{1},\lambda _{2}). \end{aligned}$$

A similar argument to that used in Ravikumar et al. (2011) now holds. Specifically, we note that even the rank one estimate \(\hat{S}^{(t)}\) will have positive diagonal entries \(\hat{S}_{ii}^{(t)}>0\) for all \(i=1,\ldots ,p\). The off-diagonal entries in the precision matrix are restricted through the \(\ell _{1}\) term. Unlike in the standard static case, the size of this norm is not related to just a single precision matrix, rather it counts the size of the off-diagonals over the whole set \(\{\Theta ^{(t)}\}_{t=1}^{T}\). Thus, to obtain strict convexity, one also needs to include appropriate smoothing. To borrow the same argument as used in Ravikumar et al. (2011), we need to demonstrate that for any time-point t we can construct a problem of the form \(\min _{\Theta ^{(t)}\in {\mathcal {S}}_{++}^{(1)}}\big \{\langle \Theta ^{(t)},\hat{S}^{(t)}\rangle -\log \det (\Theta ^{(t)})\big \}\) such that \(\Vert \Theta _{-ii}^{(t)}\Vert _{1}\le C_{t}(\lambda _{1},\lambda _{2})\). The constraint due to smoothing allows exactly this, for instance, one may obtain a bound \(\Vert \Theta _{-ii}^{(t)}\Vert _{1}\le C_{\mathrm {sparse}}(\lambda _{1},\lambda _{2})-\sum _{s\ne t}\Vert \Theta _{-ii}^{(s)}\Vert _{1}\). Writing \(\Theta ^{(s)}=\Theta ^{(1)}+\sum _{q=2}^{s}(\Theta ^{(q)}-\Theta ^{(q-1)})\) for \(s\ge 2\) we obtain

$$\begin{aligned} \sum _{s\ne t}\Vert \Theta ^{(s)}\Vert _{1}&\le \Vert \Theta ^{(1)}\Vert _{1}+\sum _{s\ne t}\sum _{q=1}^{s}\Vert \Theta ^{(q)}-\Theta ^{(q-1)}\Vert _{1}\\&\le pC_{\mathrm {smooth}}(\lambda _{1},\lambda _{2})+\Vert \Theta ^{(1)}\Vert _{1}, \end{aligned}$$

where we note \(\Vert \cdot \Vert _{1}\le p\Vert \cdot \Vert _{F}\). Converting to the bound for the \(\ell _{1}\) norm at time t we find

$$\begin{aligned} \Vert \Theta _{-ii}^{(t)}\Vert _{1}\le C_{\mathrm {sparse}}(\lambda _{1},\lambda _{2})-pC_{\mathrm {smooth}}(\lambda _{1},\lambda _{2})-\Vert \Theta ^{(1)}\Vert _{1}\equiv C_{t}(\lambda _{1},\lambda _{2}), \end{aligned}$$

and thus an effective bound on the time-specific \(\ell _{1}\) norm can be obtained. \(\square \)

4.2 Consistency when \(\hat{K}\ge K\) (Proof Sketch for Proposition 1)

Proof

The below constitutes a sketch of the proof for Proposition 1. Let us start by recalling the assumption \(K\le \hat{K}\le K_{\max }\). We can split the bound into two, one corresponding to the correctly identified number of changes and one for those which exceed this number:

$$\begin{aligned}&P\left[ \big \{ h(\hat{\mathcal {T}}_{\hat{K}}\Vert {\mathcal {T}}_{K})\ge T\delta _{T}\big \}\cap \big \{ K\le \hat{K}\le K_{\max }\big \}\right] \le \underset{\rightarrow 0\;{\mathrm {via}}\;{\mathrm {Thm.1}}}{\underbrace{P[h(\hat{{\mathcal {T}}}_{K}\Vert {\mathcal {T}}_{K})\ge T\delta _{T}]}} \\&\quad +\,\sum _{L> K}^{K_{\max }}P[h(\hat{\mathcal {T}}_{L}\Vert {\mathcal {T}}_{K})\ge T\delta _{T}]. \end{aligned}$$

Focusing on the second bound, Harchaoui and Lévy-Leduc (2010) demonstrates this can be broken down in terms of three events:

$$\begin{aligned} \sum _{L> K}^{K_{\max }}P[h(\hat{\mathcal {T}}_{L}\Vert {\mathcal {T}}_{K})\ge T\delta _{T}]\le \sum _{L > K}^{K_{\max }}\sum _{k=1}^{K}P[F_{T,k,1}]+P[F_{T,k,3}]+P[F_{T,k,3}] \end{aligned}$$

where

$$\begin{aligned} F_{T,k,1}&= {} \{|\hat{\tau }_{l}-\tau _{k}|\ge T\delta _{T}\,{\mathrm {and}}\,\hat{\tau }_{l}<\tau _{k},\,\forall 1\le l\le L\}\\ F_{T,k,2}&= {} \{|\hat{\tau }_{l}-\tau _{k}|\ge T\delta _{T}\,{\mathrm {and}}\,\hat{\tau }_{l}>\tau _{k},\,\forall 1\le l\le L\}\\ F_{T,k,3}&= {} \{\exists 1\le l\le L\,|\,|\hat{\tau }_{l}-\tau _{k}|\ge T\delta _{T},\,|\hat{\tau }_{l+1}-\tau _{k}|\ge T\delta _{T},\,{\mathrm {and}}\,\hat{\tau }_{l}<\tau _{k}<\hat{\tau }_{l+1}\}. \end{aligned}$$

The probability of these events can all be bounded in a similar way to that of Eq. 14, i.e. by considering specific start and end points in the optimality conditions given by Lemma 1. In the interests of space, we refer the reader to Harchaoui and Lévy-Leduc (2010) for technical details of this procedure.

Finally, in the statement of the above proposition, we asserted that there was some \(K_{\max }\) which upper bounded the number of estimated changepoints. By considering the bias of the GFGL estimator as a function of specific \(\lambda _1,\lambda _2\) which scale as discussed in the main paper one should be able to demonstrate that the estimated number of changepoints is indeed bounded from above. In the seminal paper of Harchaoui and Lévy-Leduc (2010), this is achieved via Lemma 2 of Meinshausen and Yu (2009). In the case of GFGL, with the log-determinant constrained likelihood and the additional regulariser, a new result would be needed to upper bound \(\hat{K}\). One possible pathway to achieving this would be via the analysis provided in Rothman et al. (2008), and extending this to the non-stationary setting. Proof of such a specific result is left as future work. \(\square \)

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gibberd, A., Roy, S. Consistent multiple changepoint estimation with fused Gaussian graphical models. Ann Inst Stat Math 73, 283–309 (2021). https://doi.org/10.1007/s10463-020-00749-0

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10463-020-00749-0

Keywords

Navigation