Abstract
We consider the consistency properties of a regularised estimator for the simultaneous identification of both changepoints and graphical dependency structure in multivariate time-series. Traditionally, estimation of Gaussian graphical models (GGM) is performed in an i.i.d setting. More recently, such models have been extended to allow for changes in the distribution, but primarily where changepoints are known a priori. In this work, we study the Group-Fused Graphical Lasso (GFGL) which penalises partial correlations with an L1 penalty while simultaneously inducing block-wise smoothness over time to detect multiple changepoints. We present a proof of consistency for the estimator, both in terms of changepoints, and the structure of the graphical models in each segment. We contrast our results, which are based on a global, i.e. graph-wide likelihood, with those previously obtained for performing dynamic graph estimation at a node-wise (or neighbourhood) level.
Similar content being viewed by others
Notes
For a reference on static graphical models the reader is directed to Lauritzen (1996).
References
Angelosante, D., Giannakis, G. B. (2011). Sparse graphical modeling of piecewise-stationary time-series. In: International conference on acoustics, speech and signal processing (ICASSP) (pp. 1960–1963).
Bai, J. (1997). Estimation of a change point in multiple regression models. The Review of Economics and Statistics, 79(4), 551–563.
Banerjee, O., Ghaoui, L. E. (2008). Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. Journal of Machine Learning Research, 9, 485–516.
Cai, T., Liu, W., Luo, X. (2011). A constrained l1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494), 594–607.
Fryzlewicz, P. (2014). Wild binary segmentation for multiple change-point detection. Annals of Statistics, 42(6), 2243–2281.
Gibberd, A. J., Nelson, J. D. B. (2014). High dimensional changepoint detection with a dynamic graphical lasso. In: ICASSP, IEEE international conference on acoustics, speech and signal processing - proceedings (pp. 2684–2688).
Gibberd, A. J., Nelson, J. D. B. (2017). Regularized estimation of piecewise constant Gaussian graphical models : The group-fused graphical Lasso. Journal of Computational and Graphical Statistics, 26(3), 623.
Hallac, D., Park, Y., Boyd, S., Leskovec, J. (2017). Network inference via the time-varying graphical lasso. In: ACM SIGKDD international conference on knowledge discovery and data mining (KDD).
Harchaoui, Z., Lévy-Leduc, C. (2010). Multiple change-point estimation with a total variation penalty. Journal of the American Statistical Association, 105(492), 1480–1493.
Hinkley, D. V. (1970). Inference about the change point in a sequence of random variables. Biometrika, 57(1), 1–17.
Killick, R., Fearnhead, P., Eckley, I. A. (2012). Optimal detection of changepoints with a linear computational cost. Journal of the American Statistical Association, 107(500), 1590–1598.
Kolar, M., Xing, E. P. (2012). Estimating networks with jumps. Electronic Journal of Statistics, 6, 2069–2106.
Lauritzen, S. L. (1996). Graphical models. Oxford: Oxford University Press.
Lee, S., Seo, M. H., Shin, Y. (2016). The lasso for high dimensional regression with a possible change point. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(1), 193–210.
Leonardi, F., Bühlmann, P. (2016). Computationally efficient change point detection for high-dimensional regression (pp. 1–32). arXiv:1601.03704.
Meinshausen, N. (2008). A note on the Lasso for Gaussian graphical model selection. Statistics & Probability Letters, 78, 880–884.
Meinshausen, N., Bühlmann, P. (2006). High dimensional graphs and variable selection with the Lasso. The Annals of Statistics, 34(3), 1436–1462.
Meinshausen, N., Yu, B. (2009). Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics, 37(1), 246–270.
Monti, R. P., Hellyer, P., Sharp, D., Leech, R., Anagnostopoulos, S., Montana, G. (2014). Estimating time-varying brain connectivity networks from functional MRI time series. NeuroImage, 103, 427–443.
Negahban, S. N., Ravikumar, P., Wainwright, M. J., Yu, B. (2012). A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Statistical Science, 27(4), 538–557.
Raimondo, M. (1998). Minimax estimation of sharp change points. Annals of Statistics, 26(4), 1379–1397.
Ravikumar, P., Wainwright, M. J., Raskutti, G., Yu, B. (2010). High-dimensional covariance estimation by minimizing l1-penalized log-determinant divergence. Electronic Journal of Statistics, 5, 935–980.
Rothman, A. J., Bickel, P. J., Levina, E., Zhu, J. (2008). Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2, 494–515.
Roy, S., Atchadé, Y., Michailidis, G. (2016). Change point estimation in high dimensional Markov random-field models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79, 1187–1206.
Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using l1-constrained quadratic programming (Lasso). IEEE Transactions on Information Theory, 55(5), 2183–2202.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: Proof of changepoint consistency
We relate the proof bounding the maximum deviation between estimated and true changepoints to the probability of an individual changepoint breaking the bound. Following (Harchaoui and Lévy-Leduc 2010), we utilise the union bound
The complement of the event on the LHS is equivalent to the target of proof; we wish to demonstrate \(P[\max _{k}|\tau _{k}-\hat{\tau }_{k}|\le T\delta _{T}]\rightarrow 1\). In order to show this, we need to show the LHS above goes to zero as \(T\rightarrow \infty \). It is sufficient, via the union bound, to demonstrate that the probability of the bad events:
go to zero for all \(k\in [K]\). The strategy presented here separates the probability of \(A_{T,k}\) occurring across complementary events. In particular, let us construct what can be thought of as a good event, where the estimated changepoints are within a region of the true ones:
The task is then to show that \(P[A_{T,k}]\rightarrow 0\) by showing \(P[A_{T,k}\cap C_{T}]\rightarrow 0\) and \(P[A_{T,k}\cap C_{T}^{c}]\rightarrow 0\) as \(T\rightarrow 0\).
1.1 Stationarity-induced bounds
As a first step, let us introduce some bounds based on the optimality conditions which occur in probability one. We base our results on a set of events which occur in relation to these conditions. By intersecting these events with \(A_{T,k}\cap C_{T}\) and \(A_{T,k}\cap C_{T}^{c}\), we can construct an upper bound on the probability for changepoint error exceeding a level \(T\delta _{T}\).
Without loss of generality, consider the optimality equations (Lemma. 1) with changepoints \(l=\tau _{k}\) and \(l=\hat{\tau }_{k}\) such that \(\hat{\tau }_{k}<\tau _{k}\). We note, that an argument for the reverse situation \(\tau _{k}>\hat{\tau }_{k}\) follows through symmetry. Taking the differences between the equations we find
The gradient from the \(\ell _{1}\) term \(\sum _{t=\hat{\tau }_{k}}^{\tau _{k}-1}\lambda \hat{R}_{1}^{(t)}\) can obtain a maximum value of \(\pm \lambda _{1}(\tau _{k}-\hat{\tau })\) for each entry in the precision matrix. Transferring this to the RHS and splitting the LHS in terms of the stochastic and estimated terms, we obtain
The next step is to replace the time indexed inverse precision matrices \(\Theta ^{(t)}\) with the block-covariance matrices indexed \(\Sigma _0^{(k)}\) and \(\Sigma _0^{(k+1)}\). We can re-express the difference in precision matrices as the sum of a difference between true values before \(\tau _{k}\) , i.e. \(\Sigma _0^{(k+1)}-\Sigma _0^{(k)}\), and the difference between the next (\(k+1\))st true block and estimated block, i.e. \(\hat{\Sigma }^{(k+1)}-\Sigma _0^{(k+1)}\) to obtain:
which holds with probability one. Define the events:
Since we know that the bound (11) occurs with probability one, then the union of these three events must also occur with probability one, i.e. \(P[E_{1}\cup E_{2}\cup E_{3}]=1\).
1.2 Bounding the good cases
One of the three events above is required to happen, either together, or separately. We can thus use this to bound the probability of both the good \(C_{T}\) and bad \(A_{T,k}\) events. Similar to Harchaoui and Lévy-Leduc (2010), Kolar and Xing (2012) we obtain
The following sub-sections describe how to separately bound these sub-events.
Unlike in the work of Kolar and Xing (2012), there is no stochastic element (related to the data \(X_{t}\)) within the first event \(A_{T,k,1}\). We can bound the probability of \(P[A_{T,k,1}]\) by considering the event \(\{\frac{1}{3}\Vert R_{1}\Vert _{F}\le \lambda _{2}+\lambda _{1}\sqrt{p(p-1)}(\tau _{k}-\hat{\tau }_{k})\}\). Given \(\Vert R_{1}\Vert _{F}=\Vert \sum _{t=\hat{\tau }_{k}}^{\tau _{k}-1}\Sigma _0^{(k)}-\Sigma _0^{(k+1)}\Vert _{F}\ge (\tau _{k}-\hat{\tau }_{k})\eta _{\min }\) we therefore obtain the bound
When the events \(C_{T},A_{T,k}\) occur we have \(T\delta _{T}<\tau _{k}-\hat{\tau }_{k}\le d_{\min }/2\) to ensure the event \(A_{T,k,1}\) does not occur, we need:
These conditions are satisfied by Assumption 1. Thus, for a large enough T, we can show that the probability \(P[A_{T,k,1}]=0\), the size of this T depends on the quantities in Eq. (12).
Now let us consider the event \(A_{T,k,2}\). Consider the quantity \(\bar{\tau }_{k}:=\lfloor (\tau _{k}+\tau _{k+1})/2\rfloor \). On the event \(C_{n}\), we have \(\hat{\tau }_{k+1}>\bar{\tau }_{k}\) so \(\hat{\Sigma }^{(t)}=\hat{\Sigma }^{(k+1)}\) for all \(t\in [\tau _{k},\bar{\tau }_{k}]\). Using the optimality conditions (Proposition 1) with changepoints at \(l=\bar{\tau }_{k}\) and \(l=\tau _{k}\) we obtain
and thus
We now combine the bounds for events \(E_{1}\) and \(E_{2}\), via \(E_{2}:=\{\Vert R_{2}\Vert _{F}\ge \frac{1}{3}\Vert R_{1}\Vert _{F}\}\) and the bounds \(\Vert R_{1}\Vert _{F}\ge (\tau _{k}-\hat{\tau }_{k})\eta _{\min }\) and \(\Vert R_{2}\Vert _{F}\le (\tau _{k}-\hat{\tau }_{k})\Vert \hat{\Sigma }^{k+1}-\Sigma _0^{k+1}\Vert _{F}\) . Substituting in (13) we have
Splitting the probability into three components, we obtain
Convergence of the first two terms follows as in \(A_{T,k,1}\), the second is exactly covered in \(A_{T,k,1}\); however, the third term \(\eta _{\min }\le 3\Vert \sum _{t=\tau _{k}}^{\bar{\tau }_{k}-1}W^{(t)}\Vert _{F}/(\bar{\tau }_{k}-\tau _{k})\) requires some extra treatment. As \(\bar{\tau }_{k}<\tau _{k+1}\), we can relate the covariance matrix of the ground-truth (time-indexed) and block (indexed by k) such that \(\Sigma ^{(t)}=\Sigma _0^{(k)}\) for all \(t\in [\tau _{k},\tau _{k+1}]\). One can now write the average sampling error across time according to:
where
and \(s^{(k)}\subseteq [\tau _k,\tau _{k+1}]\) is a subset of the kth changepoint interval. To simplify the notation, we will refer to the above quantities with a general subset of data \(n=|s^{(k)}|\), principally, this is because in a block the samples are i.i.d so only the length n distinguishes the quantity.
Lemma 2
(Error Bound on Empirical Covariance Matrix)
Let \(X^{(t)}\sim {\mathcal {N}}(0,\Sigma _{0}^{(k)})\), or more generally be sub-Gaussian with parameter \(\sigma _{t}\) for \(t=1,\ldots ,n\), then
where \(c_{1}=\max _{i,t}\{\Sigma _{0;ii}^{(t)}\}2^{7}(1+4\max _{t}\{\sigma _{t}^{2}\})^{2}\) for all
The result is a corollary of relating the empirical covariance under sub-Gaussian sampling to the sum of sub-exponential random variables, see 1 for details.
Finally, let us turn to \(A_{T,k,3}\). Recall \(P(A_{T,k,3}):=P(A_{T,k}\cap C_{T}\cap E_{3}):=P(A_{T,k}\cap C_{T}\cap \{\Vert \sum _{t=\hat{\tau }_{k}}^{\tau _{k}-1}W^{(t)}\Vert _{F}\ge \Vert R_{1}\Vert _{F}/3\})\). Given that \(\Vert R_{1}\Vert _{F}\ge (\tau _{k}-\hat{\tau }_{k})\eta _{\min }\) with probability 1, an upper bound on \(P[A_{T,k,3}]\) can be found using the same concentration bounds (Lemma 2) as for \(A_{T,k,2}\). The only difference is that we need to replace the integration interval n with \(T\delta _{T}\). Noting that \(T\delta _{T}<\tau _{k}-\hat{\tau }_{k}\le d_{\min }/2\), the overall bound will be dominated by the concentration results requiring \(n>T\delta _{T}\).
1.3 Bounding the bad cases
In order to complete the proof, we need to demonstrate that \(P[A_{T,k}\cap C_{T}^{c}]\rightarrow 0\). Again, the argument below follows that of Harchaoui and Lévy-Leduc (2010), whereby the bad case is split into several events:
where \(C_{T}^{c}=\{\max _{k\in [K]}|\hat{\tau }_{k}-\tau _{k}|\ge d_{\min }/2\}\) is the complement of the good event. The events above correspond to estimating a changepoint; a) before the previous true changepoint (\(D_{T}^{(l)}\)); b) between the previous and next true changepoint (\(D_{T}^{(m)}\)), and c) after the next true changepoint (\(D_{T}^{(r)}\)). The events \(D_{T}^{(l)}\) and \(D_{T}^{(r)}\) appear to be particularly bad as the estimated changepoint is very far from the truth, due to symmetry we can bound these events in a similar manner. Focussing on the middle term \(P[A_{T,k}\cap D_{T}^{(m)}]\), let us again assume \(\hat{\tau }_{k}<\tau _{k}\) , the reverse arguments hold by symmetry.
The probability of the intersection of \(A_{T,k}\) and \(D_{T}^{(m)}\) can be bounded from above by considering the events
In particular, one can demonstrate that:
Let us first assess \(P[A_{T,k}\cap D_{T}^{(m)}\cap E_{k}^{'}]\), and consider the stationarity conditions (10) with start and end points set as \(l=\hat{\tau }_k\), \(l=\tau _{k}\) and \(l=\hat{\tau }_{k},l=\tau _{k+1}\) . We, respectively, obtain:
and
Using the triangle inequality, we bound \(\Vert \Sigma _0^{(k+1)}-\Sigma _0^{(k)}\Vert _{F}\) conditional on \(E_{k}^{'}:=\{(\hat{\tau }_{k+1}-\tau _{k})\ge d_{\min }/2\}\) and \(A_{T,k}:=\{|\tau _{k}-\hat{\tau }_{k}|>T\delta _{T}\}\). Specifically, we construct the event
which bounds the first term of (19) such that \(P[A_{T,k}\cap E_{k}^{'}\cap D_{T}^{(m)}]\le P[H_{T}^{\Sigma }\cap \{\tau _{k}-\hat{\tau }_{k}\ge T\delta _{T}\}\cap E_{k}^{'}]\). Splitting the intersection of events we now have five terms to consider
The stochastic error terms (containing \(V^{(k)}_{\tau _{k}-\hat{\tau }_{k}}\)) can then be shown to converge similarly to \(P(A_{T,k}\cap C_{T})\). Again, it is worth noting that the term involving \(T\delta _{T}\) will be slowest to converge, as \(d_{\min }=\gamma _{\min }T>\delta _{T}T\) for large T. The first three terms are bounded through the assumptions on \(d_{\min },\lambda _{1},\lambda _{2}\), and \(\delta _{T}\) as required by the theorem (and enforce a similar requirement to those used to bound \(P(A_{T,k,1})\) in Eq. 12). The other terms in (19), i.e. \(\sum _{j=k+1}^{K}P[E_{j}^{''}\cap E_{j}^{'}\cap D_{T}^{(m)}]\) can be similarly bounded. Instead of using exactly the event \(H_{T}^{\Sigma }\) one simply replaces the term \(1/T\delta _{T}\) in (22) with \(2/d_{\min }\).
Now let us consider the events \(D_{T}^{(l)}:=\left\{ \exists k\in [K],\;\hat{\tau }_{k}\le \tau _{k-1}\right\} \cap C_{T}^{c}\). The final step of the proof is to show that the bound on \(A_{T,k}\cap D_{T}^{(l)}\), and similarly \(A_{T,k}\cap D_{T}^{(r)}\) tends to zero. To achieve this, we introduce an upper bound derived by the combinatorics of estimated changepoints:
Claim
The probability of \(D_{T}^{(l)}\) is bounded by
We omit proof for brevity, however, remark that a similar argument is made in Lévy-Leduc (2010, Eq. 31). In order to bound the above probabilities, we relate the events \(E_{l}^{''}\)and \(E_{l}^{'}\) to the optimality conditions as before, via Eq. 10 . Setting \(k=l\) and invoking the triangle inequality gives us (similarly to Eq. 22), the event
where \(M=2\lambda _{2}(|\tau _{l}-\hat{\tau }_{l}|^{-1}+|\hat{\tau }_{l+1}-\tau _{l}|^{-1})\). Conditioning on the event \(E_{l}^{''}\cap E_{l}^{'}\) implies that \(M=8\lambda _{2}/d_{\min }\). We can thus write
Finally, the term corresponding to the last changepoint can be bounded by noting that when \(k=K\) we have \(M=6\lambda _{2}/d_{\min }\), and
1.4 Summary
The bounds derived in A1–A3 demonstrate that \(P(A_{T,k})\rightarrow 0\) since \(P(A_{T,k}\cap C_{T})\rightarrow 0\) and \(P(A_{T,k}\cap C_{T}^{c})\rightarrow 0\). However, to achieve these bounds, the regularisers must be set appropriately. The event \(E_{l}^{''}\cap E_{l}^{'}\) establishes a minimal condition on T in conjunction with \(\eta _{\min }\) and the regularisers, such that \(\eta _{\min }d_{\min }/\lambda _{2}>32\) and \(\eta _{\min }/\lambda _{1}\sqrt{p(p-1)}>8\). A final condition for \(A_{T,k,1}\) requires \(\eta _{\min }T\delta _{T}/\lambda _{2}>3\). Once T is large enough to satisfy these conditions, the probabilistic bound is determined either by the smallest block size \(d_{\min }=\gamma _{\min }T\) or by the minimum error \(T\delta _{T}\). Let \(k_{\infty }=\arg \max _{k}\{\max _{ii}\Sigma _{0;ii}^{(k)}\}\) select the block which results in the largest expected covariance error. Summing the probabilities, one obtains the upper bound:
where the top row corresponds to \(D_{T}^{(l)}\) and \(D_{T}^{(r)}\); the middle \(D_{T}^{(m)}\), and the bottom \(A_{T,k,2}\) and \(A_{T,k,3}\). Since \(\delta _{T}T<\gamma _{\min }T\) for large T, the above bounds will be dominated by errors \(V^{(k_{\infty })}_{T\delta _{T}}\) integrated over the relatively small distance \(T\delta _{T}\). A suitable overall bound on the probability is
In the Gaussian case where \(\sigma _{t}=1\) for all t, we have
where \(C_{K}=K(K^{2}2^{K+1}+4)\), \(c_{3}=5^{4}2^{7}\Vert \Sigma _0^{(k_{\infty })}\Vert _{\infty }\) for all \(\eta _{\min }\le 2^{3}5^{2}p\Vert \Sigma _0^{(k_{\infty })}\Vert _{\infty }\). We thus arrive at the result of Theorem 1.
Appendix 2: Proof of model-selection consistency
Assumption 3
The event \(E_{\tau }:=\{\max _{k}|\hat{\tau }_{k}-\tau _{k}|\le T\delta _{T}\}\) holds with some increasing probability \(1-f_{\tau }(T)\rightarrow 1\) as \(T\rightarrow \infty \).
For the model selection consistency argument, we extend that presented in the stationary i.i.d setting as discussed in Ravikumar et al. (2011). Specifically, this follows the primal-dual witness method as described in Wainwright (2009). Since \(\{\hat{\Theta }^{(k)}\}_{k=1}^{B}\) is an optimal solution for GFGL, for each estimated block \(k,l=1,\ldots ,\hat{B}=K+1\) it needs to satisfy
where \(\hat{n}_{lk}\) describes the proportion of overlap between the lth true block and the kth estimated block. The term \(\sum _{l\ne k\in [\hat{B}]}\hat{n}_{lk}(V^{(l)}_{\hat{n}_{lk}})\) can be though of as providing a sampling bias due to estimation error in the changepoints, whereas the term \(\hat{n}_{kk}V^{k;\hat{n}_{kk}}\) compares samples and the ground-truth of the same underlying covariance matrix.
We will now proceed to construct an oracle estimator \(\bar{\Theta }\in {\mathbb {M}}_{\hat{B}}:=\{U^{(k)}\in {\mathbb {R}}^{p\times p}|\,U_{{\mathcal {M}}^{\perp }}^{(k)}=0,U^{(k)}\succ 0\}_{k=1}^{\hat{B}}\). The oracle is constructed through solving the restricted problem
The construction above does not utilise oracle knowledge to enforce changepoint positions, only the sparsity structure of the block-wise precision matrices. Again, for each estimate block, we obtain a set of optimality conditions like (24). Let us denote the sub-gradient of the restricted problem evaluated at the oracle solution as \(\bar{R}_{1}^{(k)}\equiv \bar{R}_{1}^{(\hat{\tau }_{k-1})}\) for the \(\ell _{1}\) penalty, and \(\bar{R}_{2}^{(\hat{\tau }_{k-1})},\bar{R}_{2}^{(\hat{\tau }_{k})}\) for the smoothing components. By definition the matrices \(\bar{R}_{2}^{(\hat{\tau }_{k-1})},\bar{R}_{2}^{(\hat{\tau }_{k})}\) are members of the sub-differential and hence dual feasible. To show that \(\bar{\Theta }\) is also a minimiser of the unrestricted GFGL problem (2), we will show that \(\Vert \bar{R}_{1;{\mathcal {M}}^{\perp }}^{(k)}\Vert _{\infty }\le 1\) and is hence dual feasible.
Ravikumar et al. (2011, Lemma 4) demonstrate that for the standard graphical lasso problem strict dual feasibility can be obtained by bounding the maxima of both the sampling and estimation error. The estimation error (on the precision matrices) is tracked through the difference (remainder) between the gradient of the log-det loss function and its first-order Taylor expansion. In our case, we will track the precision matrices at each block k via the remainder function defined as
where \(\Delta =\bar{\Theta }-\Theta _{0}\in {\mathbb {R}}^{p\times p}\).
Lemma 3
(Dual Feasibility) The out-of-subspace parameters are dual feasible such that \(\Vert \bar{R}_{1;{\mathcal {M}}^{\perp }}^{(k)}\Vert _{\infty }<1\) if
where
We note at this point, that the condition (26) in the setting where \(T\rightarrow \infty \) converges to that of the standard graphical lasso Ravikumar et al. (2011). Specifically, if changepoint error is bounded according to the event \(E_{\tau }:=\{\max _{k}|\hat{\tau }_{k}-\tau _{k}|\le T\delta _{T}\}\), the mis-specification error averaged across the block converges to the exact case \(\tilde{V}^{(k)}\rightarrow V_{\mathrm {exact}}^{(k)}\), where exact refers to the setting with zero changepoint estimation error.
Lemma 4
The average sampling error over estimated block k is bounded for some \(\epsilon <c_{5}:=2^{3}5\Vert \Sigma ^{(k_{\infty })}_0\Vert _{\infty }\) according to
for \(c_{4}=2^{7}5^{2}\Vert \Sigma ^{(k_{\infty })}_0\Vert _{\infty }\) .
Applying this result to (26) and making the choice of regulariser \(\lambda _{1}=16\alpha ^{-1}\epsilon \), enables the condition \(\Vert \tilde{V}^{(k)}\Vert _{\infty }\le \alpha \lambda _{1}/16\) in (26) to be satisfied in high probability. Specifically, we have
where \(f_{V}(T)=4p^{2}\exp \{-c_{4}^{-1}\epsilon ^{2}(\gamma _{\min }-2\delta _{T})T\}\), and note that \(f_{V}(T)\rightarrow 0\) as \(T\rightarrow \infty \).
We now turn our attention to the size of the remainder \(\Vert {\mathcal {E}}(\Delta )\Vert _{\infty }\). In the first step, we directly invoke a result from Ravikumar et al. (2011):
Lemma 5
If the bound \(\Vert \Delta \Vert _{\infty }\le (3K_{\Sigma _{0}}d)^{-1}\) holds and d is the maxmimum node degree, then
While we can use the same relation as Ravikumar et al. (2011) to map \(\Vert \Delta \Vert _{\infty }\) to \(\Vert {\mathcal {E}}(\Delta )\Vert _{\infty }\) we need to modify our argument for the actual control on \(\Vert \Delta \Vert _{\infty }\).
Lemma 6
The elementwise \(\ell _{\infty }\) norm of the error is bounded such that \(\Vert \bar{\Delta }\Vert _{\infty }=\Vert \bar{\Theta }-\Theta _{0}\Vert _{\infty }\le r\) if
and \(r\le \min \{(3K_{\Sigma _{0}}d)^{-1},(3K_{\Sigma _{0}}^{3}K_{\Gamma _{0}}d)^{-1}\}\).
Note that the contribution of the fused sub-gradient is bounded \(\lambda _{2}\hat{n}_{k}^{-1}(\Vert \bar{R}_{2}^{(\hat{\tau }_{k-1})}\Vert _{\infty }+\Vert \bar{R}_{2}^{(\hat{\tau }_{k})}\Vert _{\infty })\le 2\lambda _{2}\hat{n}_{k}^{-1}\). Let us further assume that \(\lambda _{2}=\lambda _{1}\rho \) for \(\rho >0\), we now upper bound (28) with the stated form of \(\lambda _{1}\) such that
where
We have two constraints on \(\epsilon \), one from the concentration bound (Lemma 5) whereby \(\epsilon \le 2^{3}5\Vert \Sigma _0^{(k_{\infty })}\Vert _{\infty }\), then a second from Lemma 6 gives
which implies \(\epsilon \le 1/v_{E_{V}}\) where \(v_{E_{V}}:=6dz_{\hat{n}_{k}}\max \{K_{\Sigma _{0}}K_{\Gamma _{0}},K_{\Sigma _{0}}^{3}K_{\Gamma _{0}}^{2}\}\). Now lets use Lemma 5 to obtain
where the last line comes from setting \(\epsilon =\lambda _{1}\alpha /16\). To demonstrate dual feasibility, we therefore have to satisfy the further constraint that \(\epsilon \le 1/v_{{\mathcal {E}}}\) where \(v_{{\mathcal {E}}}:=6dK_{\Gamma _{0}}^{2}K_{\Sigma _{0}}^{3}z_{\hat{n}_{k}}^{2}\). Dual feasibility is obtained in the case where
Consider the lower bound on \(\hat{n}_{k}>(\gamma _{\min }-2\delta _{T})T\rightarrow \gamma _{\min }T\), as \(T\rightarrow \infty \) then the term \(z_{\hat{n}_{k}}\rightarrow 1+16\alpha ^{-1}\). For convergence of the tail bound (Lemma 5) we need
Thus, we can choose a rate \(\epsilon =\Omega (T^{-1/2})\) and still maintain dual feasibility.
The final step of the proof is to demonstrate that all possible solutions to GFGL maintain this relation. In the case of GFGL, the following lemma states that the objective function is strictly convex on the positive-definite cone. Hence, if we find a minima it is the global minima, and the dual-feasibility condition ensures that the suggested bounds are achieved.
Lemma 7
(Strict Convexity) For matrices \(\Theta _{T}\in {\mathcal {S}}_{++}^{T}:=\{\{U^{(t)}\}_{t=1}^{T}\,|\,U^{(t)}\succ 0\,,\,U^{(t)}=U^{(t)\top }\}\) the GFGL cost function is strictly convex.
Appendix 3: Proof of Lemmata
3.1 Bounds for empirical covariance error
Proof
(Proof of Lemma 2) We want to bound \(P[\Vert V^{(k)}_n\Vert _{\infty }\ge \epsilon \)]. Our approach is to extend the single block tail bound of Ravikumar et al. (2011), derived in an i.i.d setting, to the case where samples are independent, but not necessarily identically sampled. Without loss of generality consider the event
Recall, the deviation of a sub-exponential random variable Z with \(E[Z]=\mu _{Z}\) is given by the inequality
For independent \(Z^{(t)}\), sub-exponentials with parameters \((\gamma _{t}^{2},b_{t})\) the sum \(Z_{ij}=Z_{ij}^{(1)}+\ldots +Z_{ij}^{(t)}\) is sub-exponential with parameters \((\sum _{t}\gamma _{t}^{2},\max _{t}b_{t})\). Now consider the event \({\mathbb {A}}_{ij}(v)\), and construct the auxilary variables \(A_{ij}^{(t)}:=\check{X}_{i}^{(t)}+\check{X}_{j}^{(t)}\) and \(B_{ij}^{(t)}:=\check{X}_{i}^{(t)}-\check{X}_{j}^{(t)}\) where \(\check{X}_{i}^{(t)}=X_{i}^{(t)}/\sqrt{\Sigma _{0;ii}^{(t)}}\). We note that Lemma 9 of Ravikumar et al. (2011) holds at the individual t step level such that, if \(\bar{X}_{i}^{(t)}\) is sub-Gaussian with parameter \(\sigma _{t}\), then the random variables \(A_{ij}^{(t)}\) and \(B_{ij}^{(t)}\) are sub-Gaussian with parameter \(2\sigma _{t}\), and for all \(d>0\)
Through Lemma 10 (Ravikumar et al. 2011) we have \(Z_{ij}^{(t)}:=(A_{ij}^{(t)})^{2}-2(1+\rho _{ij}^{(t)})\) is sub-exponential with parameter \(\gamma =b=16(1+4\sigma ^{2})\). Application of (30) with the tighter bound (\(0\le v\le \gamma ^{2}/b\)), gives us
where \(c_{1}=\max _{i,t}\{\Sigma _{0;ii}^{(t)}\}2^{7}(1+4\max _{t}\{\sigma _{t}^{2}\})^{2}\) for all \(d\le nc_{2}\) where \(c_{2}=2^{3}\max _{i,t}\{\Sigma _{0;ii}^{(t)}\}(1+4\min _{t}\{\sigma _{t}^{2}\})^{2}/(1+4\max _{t}\{\sigma _{t}^{2}\})\). A bound on the required quantity \(P(\Vert B^{(k)}_n\Vert _{\infty }\ge \epsilon )\) follows from application of the union bound over both A and B, and then further over the individual elements in the \(p\times p\) matrices. In a general setting, we obtain
where \(c_{1}=\max _{i,t}\{\Sigma _{0;ii}^{(t)}\}2^{7}(1+4\max _{t}\{\sigma _{t}^{2}\})^{2}\) for all
The required bound is given by the inequality \(\Vert X\Vert _{F}\le p\Vert X\Vert _{\infty }\) and setting \(d=\epsilon n\) with \(\epsilon \le c_{2}\). \(\square \)
Proof
(Proof of Lemma 4) For the quantity \(\Vert \tilde{V}^{(k)}\Vert _{\infty }\) we consider adapting the proof of Lemma 2. A bound on this can be derived from (31) setting \(n=\hat{n}_{k}\), \(d=\epsilon \hat{n}\) letting \(\max _{i,t}\{\Sigma _{0;ii}^{(t)}\}=\Vert \Sigma _0^{(k_{\infty })}\Vert _{\infty }\) and \(\max _{t}\{\sigma _{t}^{2}\}=\min _{t}\{\sigma _{t}^{2}\}=1\), i.e. all variates are Gaussian. We then obtain
for \(c_{4}=2^{7}5^{2}\Vert \Sigma _0^{(k_{\infty })}\Vert _{\infty }\) and \(\epsilon <c_{5}:=2^{3}5\Vert \Sigma _0^{(k_{\infty })}\Vert _{\infty }\). \(\square \)
3.2 Dual-feasibility with mis-specification (Proof of Lemma 3)
Proof
We can write the block-wise optimality conditions (24) for the restricted estimator as
As pointed out in Ravikumar et al. (2011), this equation may be written as an ordinary linear equation by vectorising the matrices, for instance \({\mathrm {vec}}\{(\Theta _{0}^{(k)})^{-1}\Delta ^{(k)}(\Theta _{0}^{(k)})^{-1}\}=\{(\Theta _{0}^{(k)})^{-1}\otimes (\Theta _{0}^{(k)})^{-1}\}{\mathrm {vec}}(\Delta ^{(k)})\equiv \Gamma _{0}{\mathrm {vec}}(\Delta )\). Utilising the fact \(\Delta _{{\mathcal {M}}^{\perp }}=0\) we can split the optimality conditions into two blocks of linear equations
where
Solving (32) for \({\mathrm {vec}}(\Delta _{{\mathcal {M}}}^{(k)})\) we find \({\mathrm {vec}}(\Delta _{{\mathcal {M}}}^{(k)})=-(\Gamma _{0;{\mathcal {M}}{\mathcal {M}}}^{(k)})^{-1}{\mathrm {vec}}\{G^{(k)}_{\hat{n}_k}(X;\lambda _{1},\lambda _{2})_{{\mathcal {M}}}\}\). Substituting this into (33) and re-arranging for \(\bar{R}_{1;{\mathcal {M}}^{\perp }}\) gives
and thus, letting \(H^{(k)}:=\Gamma _{0;{\mathcal {M}}^{\perp }{\mathcal {M}}}^{(k)}(\Gamma _{0;{\mathcal {M}}{\mathcal {M}}}^{(k)})^{-1}\) we obtain
Taking the \(\ell _{\infty }\) norm of both sides gives
\(\square \)
Claim
The error in the model-space dominates that outside such that
Furthermore, the maximum size of the sub-gradient in the model subspace is bounded \(\Vert \bar{R}_{1;{\mathcal {M}}}^{(k)}\Vert _{\infty }\le 1\).
A similar claim in made in Ravikumar et al. (2011), and thus via the results above, we obtain \(\Vert H^{(k)}{\mathrm {vec}}(\bar{R}_{1;{\mathcal {M}}}^{(k)})\Vert _{\infty }\le 1-\alpha \) and
The condition (26) stated in the lemma now ensures \(\Vert \bar{R}_{1;{\mathcal {M}}^{\perp }}^{(k)}\Vert _{\infty }<1\).
3.3 Control of estimation error (Proof of Lemma 6)
Note that \(\bar{\Theta }^{(k)}_{{\mathcal {M}}^{\perp }}=\Theta ^{(k)}_{0;{\mathcal {M}}^{\perp }}=0\) and thus \(\Vert \Delta ^{(k)}\Vert _{\infty }=\Vert \Delta ^{(k)}_{{\mathcal {M}}}\Vert _{\infty }\). We follow Lemma 6 from Ravikumar et al. (2011) in the spirit of our proof. The first step is to characterise the solution \(\bar{\Theta }_{{\mathcal {M}}}\) in terms of its zero-gradient condition (of the restricted oracle problem). Define a function to represent the block-wise optimality conditions [akin to Eq. 75 Ravikumar et al. (2011)]
Now construct a map \(F:\Delta ^{(k)}_{{\mathcal {M}}}\mapsto F(\Delta ^{(k)}_{{\mathcal {M}}})\) such that its fixed points are equivalent to the zeros of the gradient expression in terms of \(\Delta ^{(k)}_{{\mathcal {M}}}\). To simplify the analysis, let us work with the vectorised form and define the map
such that \(F\{{\mathrm {vec}}(\Delta ^{(k)}_{{\mathcal {M}}})\}={\mathrm {vec}}(\Delta ^{(k)}_{{\mathcal {M}}})\) iff \(Q(\Theta _{0;{\mathcal {M}}}^{(k)}+\Delta ^{(k)}_{{\mathcal {M}}})=Q(\Theta _{{\mathcal {M}}}^{(k)})=0\). Now, to ensure all solutions that satisfy the zero gradient expression may have their error bounded within the ball we demonstrate that F maps a \(\ell _{\infty }\) ball \({\mathbb {B}}(r):=\{\Theta ^{(k)}_{\mathcal {M}}|\,\Vert \Theta ^{(k)}_{\mathcal {M}}\Vert _{\infty }\le r\}\) onto itself. Expanding \(F({\mathrm {vec}}(\Delta ^{(k)}_{\mathcal {M}}))\), we find
where
The rest of the proof follows from Ravikumar et al. (2011), where via Lemma 5, one can show
under the assumptions of the lemma we obtain \(\Vert T_{1}\Vert \le r/2\). Combined with the stated form of r, we also find \(\Vert T_{2}\Vert _{\infty }\le r/2\) and thus \(\Vert F({\mathrm {vec}}(\Delta ^{(k)}_{\mathcal {M}}))\Vert _{\infty }\le r\). Through the construction of F, we have \(\Vert \Delta ^{(k)}_{\mathcal {M}}\Vert _{\infty }\le r\) iff \(Q(\Theta _{0;{\mathcal {M}}}^{(k)}+\Delta ^{(k)}_{\mathcal {M}})=Q(\Theta _{\mathcal {M}}^{(k)})=0\) and since \(Q(\bar{\Theta }_{\mathcal {M}}^{(k)})=0\) for any \(\bar{\Theta }_{\mathcal {M}}^{(k)}\) we obtain \(\Vert \bar{\Delta }^{(k)}_{\mathcal {M}}\Vert _{\infty }\le r\) where \(\bar{\Delta }^{(k)}_{\mathcal {M}}:=\bar{\Theta }^{(k)}-\Theta _{0}^{(k)}\). Finally, the existence of a solution \(\bar{\Theta }_{\mathcal {M}}^{(k)}\) corresponding to \({\mathrm {vec}}(\bar{\Delta }^{(k)}_{\mathcal {M}})\in {\mathbb {B}}(r)\) is guaranteed by Brouwer’s fixed point theorem.
Appendix 4: Further results
4.1 Strict convexity of GFGL (Proof of Lemma 7)
Proof
The negative log-det barrier \(-\log \det (\Theta ^{(t)})\) is strictly convex on \(\Theta ^{(t)}\in {\mathcal {S}}_{++}^{1}\). While the frobenius norm is strictly convex on a given matrix \(\Vert A\Vert _{F}\) for \(A\in {\mathcal {S}}_{++}^{1}\) it is not strictly convex when considering the mixed norm \(\sum _{t=2}^{T}\Vert \Theta ^{(t)}-\Theta ^{(t-1)}\Vert _{F}\) for \(\Theta ^{(t)}\in {\mathcal {S}}_{++}^{(T)}\). However, due to Lagrangian duality, we can re-write the GFGL problem as an explicitly constrained problem
We can alternatively write
A similar argument to that used in Ravikumar et al. (2011) now holds. Specifically, we note that even the rank one estimate \(\hat{S}^{(t)}\) will have positive diagonal entries \(\hat{S}_{ii}^{(t)}>0\) for all \(i=1,\ldots ,p\). The off-diagonal entries in the precision matrix are restricted through the \(\ell _{1}\) term. Unlike in the standard static case, the size of this norm is not related to just a single precision matrix, rather it counts the size of the off-diagonals over the whole set \(\{\Theta ^{(t)}\}_{t=1}^{T}\). Thus, to obtain strict convexity, one also needs to include appropriate smoothing. To borrow the same argument as used in Ravikumar et al. (2011), we need to demonstrate that for any time-point t we can construct a problem of the form \(\min _{\Theta ^{(t)}\in {\mathcal {S}}_{++}^{(1)}}\big \{\langle \Theta ^{(t)},\hat{S}^{(t)}\rangle -\log \det (\Theta ^{(t)})\big \}\) such that \(\Vert \Theta _{-ii}^{(t)}\Vert _{1}\le C_{t}(\lambda _{1},\lambda _{2})\). The constraint due to smoothing allows exactly this, for instance, one may obtain a bound \(\Vert \Theta _{-ii}^{(t)}\Vert _{1}\le C_{\mathrm {sparse}}(\lambda _{1},\lambda _{2})-\sum _{s\ne t}\Vert \Theta _{-ii}^{(s)}\Vert _{1}\). Writing \(\Theta ^{(s)}=\Theta ^{(1)}+\sum _{q=2}^{s}(\Theta ^{(q)}-\Theta ^{(q-1)})\) for \(s\ge 2\) we obtain
where we note \(\Vert \cdot \Vert _{1}\le p\Vert \cdot \Vert _{F}\). Converting to the bound for the \(\ell _{1}\) norm at time t we find
and thus an effective bound on the time-specific \(\ell _{1}\) norm can be obtained. \(\square \)
4.2 Consistency when \(\hat{K}\ge K\) (Proof Sketch for Proposition 1)
Proof
The below constitutes a sketch of the proof for Proposition 1. Let us start by recalling the assumption \(K\le \hat{K}\le K_{\max }\). We can split the bound into two, one corresponding to the correctly identified number of changes and one for those which exceed this number:
Focusing on the second bound, Harchaoui and Lévy-Leduc (2010) demonstrates this can be broken down in terms of three events:
where
The probability of these events can all be bounded in a similar way to that of Eq. 14, i.e. by considering specific start and end points in the optimality conditions given by Lemma 1. In the interests of space, we refer the reader to Harchaoui and Lévy-Leduc (2010) for technical details of this procedure.
Finally, in the statement of the above proposition, we asserted that there was some \(K_{\max }\) which upper bounded the number of estimated changepoints. By considering the bias of the GFGL estimator as a function of specific \(\lambda _1,\lambda _2\) which scale as discussed in the main paper one should be able to demonstrate that the estimated number of changepoints is indeed bounded from above. In the seminal paper of Harchaoui and Lévy-Leduc (2010), this is achieved via Lemma 2 of Meinshausen and Yu (2009). In the case of GFGL, with the log-determinant constrained likelihood and the additional regulariser, a new result would be needed to upper bound \(\hat{K}\). One possible pathway to achieving this would be via the analysis provided in Rothman et al. (2008), and extending this to the non-stationary setting. Proof of such a specific result is left as future work. \(\square \)
About this article
Cite this article
Gibberd, A., Roy, S. Consistent multiple changepoint estimation with fused Gaussian graphical models. Ann Inst Stat Math 73, 283–309 (2021). https://doi.org/10.1007/s10463-020-00749-0
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10463-020-00749-0