Skip to main content

Oracle inequalities for weighted group lasso in high-dimensional misspecified Cox models

Abstract

We study the nonasymptotic properties of a general norm penalized estimator, which include Lasso, weighted Lasso, and group Lasso as special cases, for sparse high-dimensional misspecified Cox models with time-dependent covariates. Under suitable conditions on the true regression coefficients and random covariates, we provide oracle inequalities for prediction and estimation error based on the group sparsity of the true coefficient vector. The nonasymptotic oracle inequalities show that the penalized estimator has good sparse approximation of the true model and enables to select a few meaningful structure variables among the set of features.

1 Introduction

In recent years, high-throughput and nonparametric complex data have been frequently collected in gene-biology, signal processing, neuroscience, and other scientific fields. With massive data in regression problem, we encounter the situation that both the number of covariates p and the sample size n are increasing, and p is a function of n, i.e., \(p=:p(n)\). The curse of dimensionality with computational complexity forces us to make the variable selection since the true regression coefficients \(\beta^{*}\) often are sparse with few nonzero components. Thus only a subset of the variable is preferable as important feature. The sparse set of nonzero coordinates in \(\beta^{*}\) also aims to choose the best model. A popular approach is to penalize the log-likelihood by adding a penalty function, which will intuitively lead to choosing a sparse model. One popular proposed method is Lasso (least absolute shrinkage and selection operator), which was introduced in Tibshirani [23] as a modification of the least square method in linear models. With the development of data science, high-dimensional statistics, including various regularization methods (such as group Lasso, weighted Lasso) have been sprung up by statisticians’ efforts for over two decades.

Ever since the methodology of Lasso linear models, studying various penalty functions (from data independent to data-driven penalty) and loss functions (from smooth to non-smooth, from Lipschitz to non-Lipschitz) remains hot in high-dimensional statistics, even though Lasso regularization has been thoroughly analyzed. However, arising in much practical application, predictors may have group structures. Yuan and Lin [29] study the problem of selecting grouped variables for accurate prediction in linear regressions, and their proposed group Lasso is an extension of Lasso for the purpose of the accuracy of estimation. When considering the variable selection in Cox models, massive data sets bring researchers unprecedented computational challenges, see Tibshirani [24]. Fan and Li [9] study the SCAD penalized partial likelihood approach for the Cox models, and the proposed estimator enjoys the oracle property if a proper regularization parameter is chosen. Zhang and Lu [34] consider different penalties for different coefficients (the adaptive Lasso), and their idea is that “unimportant variables receive larger penalties than important ones so that important variables tend to be retained in the selection process, whereas unimportant variables are more likely to be dropped”. Theoretical properties, including consistency and rate of convergence of this estimator called adaptive Lasso, are also shown by Zhang and Lu [34] when the number of covariates is fixed.

A potential characterization, which appeared in large-scale gene data associated with survival time, is that we only have a few (maybe several) significant predictors among p (maybe thousands) covariates and \(p \gg n\) apparently. For example, the survival of patients with diffuse large-B-cell lymphoma(DLBCL) after chemotherapy is affected by molecular features of the tumors, which is measured by high-dimensional microarray gene expression. Rosenwald et al. [20] adopt Cox models to identify individual genes whose expression correlated with the outcome, and the data contain \(n = 240\) patients and \(p = 7399\) gene expression levels associated with a good or an adverse outcome. The main challenge is that directly utilizing low-dimensional (classical and traditional) statistical inference and computing methods for these data is prohibitive. Fortunately, the regularized partial likelihood method can perform parameter estimation and variable selection to enhance the prediction accuracy and interpretability of the Cox models.

There is the fact that the Lasso estimator is not asymptotically normal, and accurate and limit distribution of Lasso estimate is hard to derive and does not have explicit form, see Knight and Fu [15]. To avoid this trouble, a popular method is to derive the nonasymptotic oracle inequality based on some regularity conditions. Early in 2004, oracle inequalities for prediction error were derived without sparsity or restricted eigenvalue conditions for Lasso-type estimators [see Greenshtein and Ritov [10], Bartlett et al. [3]].

In the classical consistency analysis, the model size p is fixed, and the sample size n goes to infinity. While we need nonasymptotic error bounds in high-dimensional statistical consistency analysis when both model size p and sample size n go to infinity.

Let \({\beta ^{*}}\) be the true regression coefficient obtained from regression data \(\{{ X}_{i}, Y_{i}\} _{i = 1}^{n}\), where \({X}_{i}\) is p-dimensional covariates and \(Y_{i} \in \mathbb{R}\) is the response. A modern problem, which will be the focus of this paper, is the behavior of β̂ when its dimension grows with the number of samples. There are two types of statistical guarantees of a penalized estimate that are of interest in this setting (as mentioned by Bartlett et al. [3]):

  1. 1.

    Prediction error (Persistence): β̂ performs well on future samples

    $$ \bigl(\text{i.e., } {\mathrm{{E}}} {\bigl[{X}\bigl( \hat{\beta }- {\beta ^{*}} \bigr)\bigr]^{2}}\text{ (or its empirical version) is small, called persistence}\bigr). $$
  2. 2.

    \(\ell _{1}\)-estimated error: β̂ approximates some “true” parameter \(\beta ^{*}\)

    $$ \bigl(\text{i.e., } \bigl\Vert \hat{\beta }- {\beta ^{*}} \bigr\Vert _{1}\text{ is small with high probability}\bigr). $$

The two types of statistical guarantees can be obtained by following error bounds (say oracle inequalities)

$$ \bigl\Vert \hat{\beta }- {\beta ^{*}} \bigr\Vert _{1} \le {O_{p}}(s{\lambda _{n}}), \qquad {\mathrm{{E}}} {\bigl[{X} \bigl( \hat{\beta }- {\beta ^{*}} \bigr)\bigr]^{2}} \le {O_{p}}\bigl(s{\lambda _{n}^{2}}\bigr), $$

where \({\lambda _{n}}\to 0\) is a tuning parameter and \(s:=\|\beta ^{*}\|_{0}\).

Deriving oracle inequalities is a powerful mathematical skill that provides deep insight into the nonasymptotic fluctuation of an estimator compared to the ideal unknown parameter (it is called an oracle). Under linear models with group sparsity covariables, Lounici et al. [18] show oracle inequalities for estimation error (in terms of mixed \((2,p)\)-norm) and prediction error (for fixed design). Blazere et al. [5] study the properties of group Lasso estimator in sparse high-dimensional generalized linear models (GLMs) with group sparsity of the covariates, and the oracle inequalities for the prediction and estimation error. Structured sparsity has recently attracted attention to the high-dimensional data. [36] focus on the oracle inequalities for GLMs with overlapping group structures. Zhou et al. There have been considerable developments in oracle inequalities, not limited to the linear models and GLMs. Lemler [17] introduces a data-driven weighted Lasso to estimate Cox models by approximating the intensity (without using partial likelihood), and oracle inequalities in terms of an appropriate empirical K-L divergence are obtained. By focusing on misspecified Cox models with their partial likelihood, Kong and Nan [16] derive the nonasymptotic oracle inequalities for the weighted Lasso penalized negative log partial likelihood function. Similar results have been proposed for Cox models with time-dependent covariances, see Huang et al. [13] for using martingale analysis of KKT conditions. Honda and Hardle [11] consider group SCAD-type and the adaptive group Lasso estimator to do variable selection for Cox models with varying coefficients, and the \(L_{2}\) convergence rate is obtained for increasing-dimension setting \(p/n \to 0\).

Contributions:

  • The existing work on weighted group Lasso penalized Cox models has little attention on theoretical results. Yan and Huang [28] propose a weighted group Lasso method that selects important time-dependent variables with a group structure. We propose the oracle inequalities for the prediction and estimation error under the random design, which is different to Huang et al. [13] and Kong and Nan [16] (they do not consider the random design and prediction error).

  • Huang et al. [13] do not give a clear definition of the true coefficient, our true coefficient in the oracle inequalities is defined by the minimizer of the expected loss function. It is applicable for misspecified Cox models.

  • We provide unified nonasymptotic results in terms of oracle inequalities for prediction and estimation error, and this provides a theoretical justification for the consistency of weighted group Lasso estimator in Cox models (time-dependent covariates and random design).

The sections are presented as follows. Section 2 gives a brief review of Cox models. Section 3 presents the weighted group Lasso penalty for misspecified Cox models. Section 4 shows the oracle inequalities for prediction and estimation for weighted group Lasso penalized partial likelihood for misspecified Cox models, while detailed proofs are included in Sect. 5.

2 A brief review of Cox models

The celebrated Cox models have provided a tremendously successful tool for exploring the association of covariates with failure time and survival distributions. In order to match the drop-out situation in clinical trails, we consider that the continuous survival time \(T_{i}^{*}\) is governed by random right censoring. For subject i, let \(T_{i}: = {T_{i}}^{*} \wedge {C_{i}}\) be the observed survival time which is right-censored by \({C_{i}}\). And the censored indicator is denoted by \({\Delta _{i}} = 1({T_{i}}^{*} \le {C_{i}})\). Let \(\{z_{i}(t)\}_{i=1}^{n}\) be the p-dimensional time-dependent covariates, where \({z_{i}}(t): = ({z_{i1}}(t), \ldots ,{z_{ip}}(t))^{\tau }\). Here we assume that the censoring is noninformative. The time-dependent covariates may degenerate to time-independent covariates, i.e., \({z_{ik}}(t)\equiv {z_{ik}}\) for some index k. For example, the CD4 count (relate to longitudinal process) is time-dependent. The time-independent covariates are baseline covariates (i.e., internal variables), which includes treatment indicator ages, sex, treatment indicator, and so on.

Suppose that we observe n independent and identically distributed (i.i.d.) data

$$ \bigl\{ {T_{i}},{\Delta _{i}},{\bigl\{ {z_{i}}(t)\bigr\} _{0 \le t \le \tau }}\bigr\} _{i = 1}^{n}, $$
(2.1)

which is sampling from the random population \((T,\Delta ,{\{ z(t)\} _{0 \le t \le \tau }})\).

Let \(S(t|{\mathcal{Z}}) = P ( {T > t|{\mathcal{Z}}} )\) be the conditional survival function, where \({\mathcal{Z}}\) is the sigma algebra generated by some covariate variables. The relation of conditional distribution function and \(S(t|{\mathcal{Z}})\) is \(F(t|{\mathcal{Z}}) = P ( {T \le t|{\mathcal{Z}}} ) = 1 - S(t|{\mathcal{Z}})\). Denote \(f(t|{\mathcal{Z}}) = \frac{d}{d t} F(t|{\mathcal{Z}})\) as the conditional probability density function. Different from the linear model for modeling conditional mean or the quantile regression for modeling conditional quantiles, the Cox models (also called proportional hazards regression or Cox regressions) aim to model the conditional hazard rate defined by

$$ h (t|{\mathcal{Z}}) := \lim_{h \to {0 }} \frac{{P ( {t \le {T} < t + h|{T} \ge t,{\mathcal{Z}}} )}}{h}= \frac{f(t|{\mathcal{Z}})}{S(t|{\mathcal{Z}})} = - \frac{{\partial \log S(t|{\mathcal{Z}})}}{{\partial t}}. $$
(2.2)

The \(h (t|{\mathcal{Z}})\) is the conditional hazard rate at time t conditional on survival until time t or later (i.e., \(T \ge t\)). From (2.2), the \(S(t|{\mathcal{Z}})\) can be represented as the exponential integral of the cumulative hazard function defined by \(H(t) = \int _{0}^{t} h(s)\,ds\), i.e., \(S(t|Z) = \exp \{ { - \int _{0}^{t} h(s)\,\mathrm{d}s} \} \equiv {e^{ -H(t)}}\).

Having obtained the covariates \(\{z_{i}(t)\}_{i=1}^{n}\), our aim is to model the conditional hazard function of survival time \(\{T_{i}\}_{i=1}^{n} \) in a finite time interval \([0,\tau ]\) by the following semi-parametric regressions:

$$ h_{i} ( t ):=h ( t|{z_{i}} ) = {h_{0}}(t)\exp \bigl\{ z_{i}^{\tau }(t){\beta ^{*}}\bigr\} \quad \text{for } 0\leq t \leq \tau < \infty , $$
(2.3)

where \(h_{0}(t)\) is an unknown baseline hazard function, and \(\beta ^{*} \in \mathbb{R}^{p}\) is an unknown parameter which needs to be estimated.

By profiling our the term \(h_{0}(t)\), Cox [6] suggests that the inference on \(\beta ^{*}\) is based on the random likelihood function

$$ L_{n}(\beta ;T,z,\Delta )=\prod _{i=1}^{n} \biggl\{ \frac{e^{ {z_{i}^{\tau }(T_{i})}\beta }}{\sum_{j \in R_{i}} e^{ {z_{j}^{\tau }(T_{i})}\beta }} \biggr\} ^{\Delta _{i}}, $$
(2.4)

where \(R_{i}= \{ j : T_{j} \geq T_{i} \} \) is the risk set (set of individuals whose survival times are greater than \(T_{i}\)). In a later paper, Cox [7] strictly derives the so-called partial likelihood function.

Suppose that the observed time is a continuous variable, and there is no tie in the observation time. The joint likelihood for the i.i.d. data (2.1) can be written as follows:

$$\begin{aligned} {L_{n}}(\beta ,z,\Delta ) &= \prod _{i:{\Delta _{i}} = 1} f ( {{T_{i}}|{z_{i}}} )\prod _{i:{\Delta _{i}} = 0} { \bigl( {1 - F ( {{T_{i}}|{z_{i}}} )} \bigr)} \\ &=\prod_{i = 1}^{n} {{{\bigl[f ( {{T_{i}}|{z_{i}}} )\bigr]}^{{\Delta _{i}}}}\bigl[} S ( {{T_{i}}|{z_{i}}} ){\bigr]^{1 - {\Delta _{i}}}}= \prod _{i = 1}^{n} {{{\bigl[h ( {{T_{i}}|{z_{i}}} )\bigr]}^{{\Delta _{i}}}}} S ( {{T_{i}}|{z_{i}}} ) \\ & = \prod_{i = 1}^{n} {{{ \bigl\{ {{e^{z_{i}^{\tau }({T_{i}}) \beta }} {h_{0}} ( {{T_{i}}} )} \bigr\} }^{{\Delta _{i}}}}} \exp \biggl\{ { - \int _{0}^{{T_{i}}} {{h_{0}}(s){e^{z_{i}^{\tau }(s) \beta }}} \,\mathrm{d}s} \biggr\} \\ & = \exp \Biggl\{ {\sum_{i = 1}^{n} { \biggl[ {{\Delta _{i}} \bigl\{ {z_{i}^{\tau }({T_{i}}) \beta + \log {h_{0}} ( {{T_{i}}} )} \bigr\} - \int _{0}^{{T_{i}}} {{h_{0}}(s){e^{z_{i}^{\tau }(s) \beta }}} \,\mathrm{d}s} \biggr]} } \Biggr\} , \end{aligned}$$
(2.5)

which contains the unknown \({h_{0}}(\cdot )\).

The key to deriving (2.4) is by specifying a reasonable estimator \(\hat{h}_{0}(\cdot )\) for \(h_{0}(\cdot )\) in (2.3). Assume that \(h_{0}(\cdot )\) is discrete with mass \({{h_{0}}} ({T_{(1)}}), \ldots ,{{h_{0}}} ({T_{(k)}})\) at the ordered observed survival time \(T_{(1)}< \cdots < T_{(k)}\). Denote \(\{z_{(o)}{({T_{(o)}})}:o=1, \ldots , k\}\) as the k covariates corresponding to the ordered observed survival times \({T_{(o)}}\). The baseline cumulative hazard function \(H_{0}(t)\) is modeled non-parametrically as the step function \({H_{0}}(t) = \sum_{o = 1}^{k} {{h_{0}}} ({T_{(o)}})I({T_{(o)}} \le t)\), and hence \(\sum_{i = 1}^{n}\int _{0}^{{T_{i}}} {{h_{0}}(s){e^{z_{i}^{\tau }(s)\beta }}} \,\mathrm{d}s=\sum_{i = 1}^{n} {\sum_{o = 1}^{k} {{h_{0}}} ({T_{(o)}})I({T_{(o)}} \le T_{i}){e^{z_{i}^{\tau }({T_{(o)}}) \beta }}} \).

From (2.5), the joint log-likelihood function is expressed as follows:

$$\begin{aligned} &\log {L_{n}}(\beta ;T,z,\Delta ) \\ &\quad = \sum _{o = 1}^{k} { \bigl\{ {z_{({\mathrm{{o}}})}^{\tau }({T_{(o)}}) \beta + \log {h_{0}} ( {T_{(o)}} )} \bigr\} } -\sum _{o = 1}^{k} {\sum_{i = 1}^{n} {I({T_{(o)}} \le T_{i}){h_{0}}} ({T_{(o)}}){e^{z_{i}^{\tau }({T_{(o)}}) \beta }}} \\ &\quad =\sum_{o = 1}^{k} { \bigl\{ {z_{(o)}^{\tau }({T_{(o)}}) \beta + \log {h_{0}} ( {T_{(o)}} )} \bigr\} } - \sum _{o = 1}^{k} {\sum_{ \{ {j:{T_{j}} \ge {T_{(o)}}} \} } {{h_{0}}({T_{(o)}}){e^{z_{j}^{\tau }({T_{(o)}})\beta }}}, } \end{aligned}$$
(2.6)

where \({ \{ {j:{T_{j}} \ge {T_{(o)}}} \} }\) denotes the set of individual js who are “at risk” for failure at time \({T_{(o)}}\).

Taking derivative on \(\log {L_{n}}(\beta ;T,z,\Delta )\) with respect to \({h_{0}}({T_{(o)}}), o=1, \ldots , k\), we get

$$ {{\hat{h}}_{0}} ( {T_{(o)}} ) = {\biggl[\sum _{ \{ {j:{T_{j}} \ge {T_{(o)}}} \} } {{e^{z_{j}^{\tau }({T_{(o)}})\beta }}} \biggr]^{- 1}}, $$

which is also called Breslow’s estimator for the baseline hazard function.

Plugging \({{\hat{h}}_{0}} ( {T_{(o)}} )\) into (2.6), we have

$$\begin{aligned} \log {L_{n}}(\beta ;T,z,\Delta )& \propto \sum _{o = 1}^{k} {\biggl[ {z_{(o)}^{\tau }({T_{(o)}}) \beta - \log \sum_{ \{ {j:{T_{j}} \ge {T_{(o)}}} \} } {{e^{z_{j}^{\tau }({T_{(o)}})\beta }}} } \biggr]} \\ & \propto \sum_{i = 1}^{n} { \Biggl\{ {z_{i}^{\tau }({T_{i}}) \beta - \log \Biggl[ {\sum _{j = 1}^{n} 1 ( {{T_{j}} \ge {T_{i}}} )\exp \bigl\{ {z_{j}^{\tau }({T_{i}}) \beta } \bigr\} } \Biggr]} \Biggr\} } {\Delta _{i}}, \end{aligned}$$

which gives (2.4).

Following the counting process framework in Andersen and Gill [2], let \(N_{i}(t)=1 (T_{i} \leq t, \Delta _{i}=1 )\) be the counting process, and denote \(Y_{i}(t)=: 1 (T_{i} \geq t )\) to be the at-risk process for subject i. The σ-filtration is defined by \({{\mathcal{F}}_{t}} = \sigma \{ {N_{i}}(s),{Y_{i}}(s),{z_{i}}(s),s \le t,i = 1, \ldots ,n\}\), which represents the information that occurs up to time t. Let \(\mathrm{d} N_{i}(s):=1 \{T_{i} \in [s, s+\mathrm{d}s], \Delta _{i}=1 \}\). The negative log-partial-likelihood (2.4) for data (2.1) is rewritten as follows:

$$\begin{aligned} &\ell _{n}(\beta ;T,z,\Delta ) \\ &\quad :=-\frac{1}{n} \sum _{i=1}^{n} \Biggl\{ {z_{i}^{\tau }(T_{i})} \beta -\log \Biggl[ \sum_{j=1}^{n} 1 (T_{j} \geq T_{i} ) \exp \bigl\{ {z_{j}^{\tau }(T_{i})} \beta \bigr\} \Biggr] \Biggr\} \Delta _{i} \\ &\quad \propto -\frac{1}{n} \Biggl(\sum_{i=1}^{n} \int _{0}^{t} {z_{i}^{\tau }(u)} \beta \,\mathrm{d} N_{i}(u)- \int _{0}^{t} \log \Biggl[ \frac{1}{n}\sum _{j=1}^{n} 1 (T_{j} \geq u ) \exp \bigl\{ {z_{j}^{\tau }(u)}\beta \bigr\} \Biggr] \, \mathrm{d} \overline{N}(u) \Biggr) \\ &\quad =-\frac{1}{n}\sum_{i=1}^{n} \int _{0}^{t}\bigl[{z_{i}^{\tau }(u)} \beta - \log R_{n}(u, \beta )\bigr]\,\mathrm{d} N_{i}(u), \end{aligned}$$
(2.7)

where \(R_{n}(u, \beta )=\frac{1}{n}\sum_{j=1}^{n}1 (T_{j} \geq u ) \exp \{ {z_{j}^{\tau }(u)}\beta \} \) is the empirical relative risk function.

The negative log-partial likelihood function (2.7), as the summands are neither independent nor Lipschitz, can be approximated by the following intermediate empirical loss function:

$$\begin{aligned} \tilde{\ell }_{n}(\beta ;T,z,\Delta )&=- \frac{1}{n} \sum_{i=1}^{n} \bigl\{ {z_{i}^{\tau }(T_{i})}\beta -\log R(T_{i}, \beta ) \bigr\} \Delta _{i} \\ &={\ell _{n}}(\beta ;T,z,\Delta ) + \frac{1}{n}\sum _{i = 1}^{n} { \biggl\{ {\log \frac{{{R_{n}}({T_{i}},\beta )}}{{R({T_{i}},\beta )}}} \biggr\} } {\Delta _{i}} \end{aligned}$$
(2.8)

with expected relative risk function defined by \(R(t, \beta )={\mathrm{{E}}}[1(T \geq t) \exp \{ {z^{\tau }(t)}\beta \} ]\).

We define the loss function by \(l(\beta ;T,z,\Delta ): = - [{z^{\tau }}(t)\beta - \log R(t,\beta )] \Delta \).

Let \(\overline{N}(t):=\sum_{i=1}^{n} N_{i}(t)\). The gradient of \({\ell _{n}}(\beta ;T,z,\Delta )\) can be written as

$$\begin{aligned} \nabla \ell _{n}(\beta ;T,z,\Delta ):= \frac{\partial \ell _{n}(\beta ;T,z,\Delta )}{\partial {\beta }}=- \frac{1}{n} \sum_{i=1}^{n} \int _{0}^{t}\bigl[z_{i}(u)- \overline{z}_{n}(u, \beta )\bigr] \,\mathrm{d} N_{i}(u), \end{aligned}$$
(2.9)

where \({{\bar{z}}_{n}}(u,\beta ) = \frac{1}{n}\sum_{j = 1}^{n} {\frac{{{Y_{j}}(u){{\mathrm{{e}}}^{z_{j}^{\tau }(u){\beta }}}}}{{{R_{n}} ( {u,{\beta }} )}}} {z_{j}}(u)\) is the random weighted sum of covariates.

The \(\nabla \ell _{n}(\beta ;T,z,\Delta )\) is called score process, which is a martingale adapted to the filtration \(\mathcal{F}_{t}\). Furthermore, the Hessian matrix of \(\ell _{n}(\beta ;T,z,\Delta )\) is

$$ \nabla ^{2}\ell _{n}(\beta ;T,z,\Delta )=\frac{1}{n} \int _{0}^{t}{V_{n}}(u, \beta ) \, \mathrm{d} \overline{N}(u), $$

where \({V_{n}}(u, \beta ) = \frac{1}{n}\sum_{i = 1}^{n} {\frac{{{Y_{i}}(u){{\mathrm{{e}}}^{z_{i}^{\tau }(u){\beta }}}}}{{{R_{n}} ( {u,{\beta }} )}}} [z_{i}(s)-\overline{z}_{n}(u,\beta )][z_{i}(s)-\overline{z}_{n}(u, \beta )]^{\tau }\) is the random weighted sample covariance matrix. Readers can refer to technical details required to make the counting process rigorous in Andersen et al. [1].

3 Weighted group lasso for misspecified Cox models

In this section, we present the concepts and mathematics notations for the penalized misspecified Cox models with the group structure.

Many high-dimensional variables in microarrays data and other scientific applications have a natural group structure. It is better to divide p variables into small sets of variables based on biological knowledge, see Kanehisa and Goto [14], Wang et al. [27]. Suppose that the p-dimensional covariate X is divided into \(G_{n}\) groups each of size \(d_{g}\) for \(g \in \lbrace 1,\ldots,G_{n} \rbrace \),

$$ X_{i}=\bigl(X^{1}_{i},\ldots,X^{g}_{i}, \ldots,X^{G_{n}}_{i}\bigr)^{\tau },\quad i=1,\ldots,n, $$

where \(X^{g}_{i}=(X_{i,1}^{g},\ldots,X_{i,d_{g}}^{g})^{T}\) and \(\sum_{g=1}^{G_{n}}d_{g}=p\).

It is allowed that the number of groups increases with the sample size n and \(G_{n}\gg n\). We define the two quantities

$$ d_{\mathrm{max}}:= \max_{g \in \lbrace 1,\ldots,G_{n} \rbrace }d_{g} \quad \text{and}\quad d_{\mathrm{min}}:= \min_{g \in \lbrace 1,\ldots,G_{n} \rbrace }d_{g}, $$

which are crucial constants in the theoretical analysis.

For \(\beta \in \mathbb{R}^{p}\), let \(\beta ^{g}\) be the sub-vector of β whose indexes correspond to the index set of the gth group of X. Given a proper tuning parameter λ, we are interested in weighted group Lasso estimator which achieves group sparsity. It is obtained as the solution of the convex optimization problem:

$$ \hat{\beta }_{n}=\mathop{\mathrm{argmin}}\limits _{\beta \in \mathbb{R}^{p}} \Biggl\{ \ell _{n}(\beta ;T,z,\Delta )+{\lambda } {\sum _{g = 1}^{G_{n}} {{w_{g}} \bigl\Vert {\beta ^{g}} \bigr\Vert _{2}}}\Biggr\} , $$
(3.1)

where \(\Vert \cdot \Vert _{2}\) refers to the Euclidian norm and \({w_{g}}\) is a given weighted function.

If all \(d_{g}\) are of size one and \({w_{g}}=1\), then \(\sum_{g=1}^{G_{n}}{w_{g}}\| \beta ^{g}\|_{2}\) reduces to \(\| \beta \|_{1}\) which is essentially a Lasso problem; If all \(d_{g}\) are of size one and \(\{ {w_{j}}\} _{j = 1}^{p}\) are data-dependent weights (the weights only depend on observed data). Let \(W = \operatorname{diag}\{ {w_{1}}, \ldots ,{w_{p}}\} \), thus the weighted group Lasso penalty \(\sum_{g=1}^{G_{n}}{d_{g}}\| \beta ^{g}\|_{2}\) becomes weighted Lasso penalty \(\| {W\beta } \|_{1}\). Increasing λ leads to the shrinkage of \(\beta ^{g}\) tending to zero, which indicates that some blocks of β diminish to zero simultaneously, and groups of predictors are eliminated from the model. Typically in the reference, they usually choose \({w_{g}}:=\sqrt{ d_{g}}\) to penalize more heavily groups of large size. For adaptive group Lasso in Cox models, Yan and Huang [28] use \({w_{g}}=\sqrt{d_{g}} /\|\tilde{\beta }^{g}\|\), where \(d_{g}\) is the size of group g and \(\tilde{\beta }^{g}\) is some consistent estimator of \(\beta ^{g}\).

Taking the subdifferential of the objective function (3.1), we get the first order condition:

$$ \textstyle\begin{cases} \frac{\partial \ell _{n}(\beta ;T,z,\Delta )}{\partial {{\beta }}_{g}}|_{{{\beta }}_{g}=\hat{{\beta }}_{g}}=\lambda w_{k} \frac{\hat{{\beta }}_{g}}{ \Vert \hat{{\beta }}_{g} \Vert _{2}} & \text{if } \hat{{\beta }}_{g} \neq {0,} \\ \Vert \frac{\partial \ell _{n}(\beta ;T,z,\Delta )}{\partial {{\beta }}_{g}}|_{{{\beta }}_{g}=\hat{{\beta }}_{g}} \Vert _{2} \leq \lambda w_{k}& \text{if } \hat{{\beta }}_{g}={0}. \end{cases} $$
(3.2)

(It is also called Karush–Kuhn–Tucker (KKT) condition, see Sect. 2.2 of Huang et al. [13] for un-group version.) From the adaptive estimation point of view, the weights in equation (3.1) can be determined from the observed data, where KKT conditions (3.2) hold with high probability, for example, \(1-p^{r}, r<0\). Applying the concentration inequalities to martingale, the data-driven weights \(\{ {w_{j}}\} _{j = 1}^{p}\) are obtained from the KKT conditions with high probability, see Huang et al. [12] and the references therein. The motif of this work is to derive nonasymptotic oracle inequalities in a mathematical view. The choice of optimal adaptive weight and statistical inferences (confidence interval, testing the coefficient, FDR control) is left for future studies.

In the high-dimensional settings, we study the estimation and prediction of the oracle inequalities for the weighted group Lasso even when the number of groups is extremely greater than the sample size, i.e., \(G_{n}\gg n\). Define \(H^{*}= \lbrace g: \beta _{g}^{*}\neq 0 \rbrace \) as the group index set corresponding to the nonzero sub-vectors of \(\beta ^{*}\).

Let \(X_{1}, \ldots , X_{n}\) be a random sample from a measure \(\mathbb{P}\) on a measurable space \((\mathcal{X},\mathcal{A})\). We denote the empirical distribution as a discrete uniform measure \(\mathbb{P}_{n}=n^{-1} \sum_{i=1}^{n} \delta _{X_{i}}\), where \(\delta _{x}\) is the probability distribution that is degenerate at x.

The expected loss function is defined by

$$ \ell (\beta ;T,z,\Delta ) = - {\mathrm{{E}}} \bigl[ { \bigl\{ {{z^{\tau }}(T) \beta - \log R(T,\beta )} \bigr\} \Delta } \bigr] =: { \mathrm{{E}}}l( \beta ;T,z,\Delta ). $$

Corresponding to the form of estimator, the true parameter of the misspecified Cox models is the minimizer of the expected loss function

$$ {\beta ^{*}}=\mathop{\mathrm{{argmin}}}\limits _{{\beta } \in {{\mathbb{R}}^{p}}} \mathbb{P}l(\beta ;T,z,\Delta )=\mathop{\mathrm{{argmin}}} \limits _{{\beta } \in {{\mathbb{R}}^{p}}}-{ \mathrm{{E}}} \bigl\{ { \bigl[ {{z^{\tau }}(T)\beta - \log R(T,\beta )} \bigr]\Delta } \bigr\} , $$
(3.3)

where \(R(t, \beta )={\mathrm{{E}}}[1(T \geq t) \exp \{ {z^{\tau }(t)}\beta \} ]\).

Definition (3.3) was pioneeringly studied in Struthers and Kalbfleisch [21] by clarifying the true parameter as a solution of estimating equation neatly mentioned in the proof of Lemma 3.1 in Andersen and Gill [2].

Here, the expectation of the random variables in the model is unknown, thus as well as \({\beta ^{*}}\). By solving the optimization problem in (3.3), \({\beta ^{*}}\) satisfies

$$ {\beta ^{*}} = \biggl\{ {{{\beta } \in {{ \mathbb{R}}^{p}}} : \mathbb{P}\dot{l}(\beta ;T,z,\Delta )=-{ \mathrm{{E}}} \biggl[ { \biggl\{ {z(T) - \frac{{{\mathrm{{E}}}[Y(t){z}(T){\mathrm{e} ^{{z^{\tau }}(T)\beta }}]}}{{{\mathrm{{E}}}[Y(t){\mathrm{e} ^{{z^{\tau }}(T)\beta }}]}}} \biggr\} \Delta } \biggr]= 0} \biggr\} . $$
(3.4)

In order to get the unique solution in (3.4), we require that the Hessian matrix for expected loss function

$$ \begin{aligned}[b] {\mathrm{{E}}}\ddot{l}(\beta ;T,z,\Delta )&={\mathrm{{E}}} \biggl[ \biggl\{ \frac{{{\mathrm{{E}}}[Y(t)z(T){z^{\tau }}(T){{\mathrm{{e}}}^{{z^{\tau }}(T)\beta }}]}}{{{\mathrm{{E}}}[Y(t){{\mathrm{{e}}}^{{z^{\tau }}(T)\beta }}]}} \\ &\quad {}- \frac{{{\mathrm{{E}}}[Y(t)z(T){{\mathrm{{e}}}^{{z^{\tau }}(T)\beta }}]{\mathrm{{E}}}[Y(t){z^{\tau }}(T){{\mathrm{{e}}}^{{z^{\tau }}(T)\beta }}]}}{{{{({\mathrm{{E}}}[Y(t){{\mathrm{{e}}}^{{z^{\tau }}(T)\beta }}])}^{2}}}} \biggr\} \Delta \biggr] \end{aligned} $$
(3.5)

is nonpositive definite.

We aim to estimate sparse \(\beta ^{*}\) and to predict the hazard function \(h ( t|{z_{i}}(t) )\) conditionally on a given process \({z_{i}}(t)\). To facilitate the technical proof, additional assumptions are required.

  • (H.1): The covariates \(\{z_{i j}(t)\}\) are almost surely bounded by a positive constant L, i.e.,

    $$ \sup_{0 \le t \le \tau } \max_{1 \le i \le n,1 \le j \le p } \bigl\vert z_{i j}(t) \bigr\vert \le L,\quad \mbox{a.s.} $$
  • (H.2): Assume that the parameter space is compact, \(\|\beta ^{*} \|_{1} \le B\), where B is a positive constant.

  • (H.3): There exists a large constant M such that β̂ is in the weighted \(\ell _{2}\)-ball

    $$ {{\mathcal{S}}_{M}}\bigl(\beta ^{*}\bigr):= \Biggl\{ { \beta \in {\mathbb{R}^{p}}:{\sum_{g = 1}^{G_{n}} {{w_{g}} \bigl\Vert {\beta ^{g}}- \beta ^{*g} \bigr\Vert _{2}}} \le {M}} \Biggr\} . $$
  • (H.4): Under \(\Delta =1\), there exists a constant \({c_{l}}> 0\) and \({c_{u}}<\infty \) such that \(\ddot{l}(\beta ;t,z,\Delta )\) is uniformly positive definite for all \({\beta \in {{\mathcal{S}}_{M}}(\beta ^{*})}\)

    $$ {c_{u}} {z(t)} {z^{{\tau }}(t)} \succ {\mathrm{{E}}}\bigl[ \ddot{l}(\beta ;T,z, \Delta )|z(t)\bigr]\succ {c_{l}} {z(t)} {z^{{\tau }}(t)}\quad \mbox{a.s.} $$

(H.1) and (H.2) are standard assumptions in deriving consistency property for regularized GLMs, see Blazere et al. [5], Zhang and Wu [33]. (H.2) is also used in Zhao et al. [35] for the increasing dimensional Cox models with interval-censored data. (H.3) has been addressed by Kong and Nan [16]. (H.4) makes sure the object function for a minimizer of population expected loss is strongly convex, a similar assumption is used in Andersen and Gill [2], Fan and Li [9].

As mentioned by one reviewer, we often assume that the data are generated from the model with some baseline hazard function and some true parameter \(\beta ^{*}\). In (3.3), the true parameter is defined as the minimizer of true loss function. We present it in detail from Theorem 1 in Struthers and Kalbfleisch [21].

Lemma 3.1

(Consistency)

Let the expectation E be taken with respect to randomness of \(\{ (T_{i}, \Delta _{i}, z_{i}(t) )\}_{i=1}^{n}\) from the true model. Consider the following notations for \(r=0,1,2\):

$$ \begin{gathered} S^{(r)}(t)=n^{-1} \sum _{i=1}^{n} Y_{i}(t)h_{0}(t)e^{{z_{i}^{\tau }(t) \beta ^{*}}} z_{i}(t)^{\otimes r}, \qquad s^{(r)}(t)={\mathrm{E }} \bigl[S^{(r)}(t)\bigr], \\ S^{(r)}(\beta , t)=n^{-1} \sum_{i=1}^{n} Y_{i}(t) e^{{z_{i}^{\tau }(t) \beta }} z_{i}(t)^{\otimes r}, \qquad s^{(r)}(\beta , t)={\mathrm{E }}\bigl[S^{(r)}( \beta , t) \bigr], \end{gathered} $$

where, for a column vector a, \(a^{\otimes 2}\) refers to the matrix \(a a^{T}\), \(a^{\otimes 1}\) refers to the vector a, and \(a^{\otimes 0}\) refers to the scalar 1. Consider the following conditions.

Condition 3.1

There exists a neighborhood \({\mathcal{S}}_{M}(\beta ^{*})\) of \(\beta ^{\ast }\) such that, for each \(t<\infty \),

$$ \sup_{x\in [ 0,t ] ,\beta \in {{\mathcal{S}}_{M}}(\beta ^{*})} \bigl\vert S^{ ( 0 ) } ( \beta ,x ) -s^{ ( 0 ) } ( \beta ;x ) \bigr\vert \rightarrow 0, \quad \text{in probability as } n\rightarrow \infty . $$

Condition 3.2

(a). The \(s^{ ( 0 ) } ( \beta ,x ) \) is bounded away from zero on \({\mathcal{S}}_{M}(\beta ^{*})\times [ 0,t ] \), and \(s^{ ( 0 ) } ( \beta ,x ) \) and \(s^{ ( 1 ) } ( \beta ,x ) \) are bounded on \({\mathcal{S}}_{M}(\beta ^{*})\times [ 0,t ] \); (b). For each \(t<\infty \), we have \(\int _{0}^{t}s^{ ( 2 ) } ( x ) \,dx< \infty \).

  • When the data are generated from the correctly specified Cox models (2.3), under Conditions 3.1 and 3.2, we have that the maximum partial likelihood estimator β̂ is a consistent estimator for \(\beta ^{\ast }\), where \(\beta ^{\ast }\) is the solution to the equation \(h ( \beta ) =0\) with

    $$ h ( \beta ) := \int _{0}^{\infty }s^{ ( 1 ) } ( t ) \,dt- \int _{0}^{\infty } \frac{s^{ ( 1 ) } ( \beta ,t ) }{s^{ ( 0 ) } ( \beta ,t ) }s^{ ( 0 ) } ( t ) \,dt. $$
  • When the model is misspecified, i.e., suppose that the true hazard function is \(h_{i} ( t )\ne {h_{0}}(t)e^{ z_{i}^{\tau }(t){\beta ^{*}}}\). If \(S^{(r)}(t)\) and \(s^{(r)}(t)\) are replaced by \(S_{m}^{(r)}(t):=n^{-1} \sum_{i=1}^{n} Y_{i}(t)h_{i} ( t ) z_{i}(t)^{\otimes r}\) and \(s_{m}^{(r)}(t):={\mathrm{E }}[S_{m}^{(r)}(t)]\) in Conditions 3.1 and 3.2, then the solution of the equation \(h_{m} ( \beta ) =0\) with

    $$ h_{m} ( \beta ) := \int _{0}^{\infty }s_{m}^{ ( 1 ) } ( t ) \,dt- \int _{0}^{\infty } \frac{s^{ ( 1 ) } ( \beta ,t ) }{s^{ ( 0 ) } ( \beta ,t ) }s_{m}^{ ( 0 ) } ( t ) \,dt $$
    (3.6)

    is the pseudo-true parameter \(\beta ^{\ast }\).

Since \(d M_{i}(t):=d {N_{i}}(t)-{1 ( {{T_{i}} \ge t} )} {h_{0}}(t)e^{{z_{i}^{\tau }(t)}\beta ^{*}}\,dt\) is mean-zero \({{\mathcal{F}}_{t}}\)-martingale from the theory in Andersen and Gill [2], by comparing the empirical version (2.9) and the population version (3.4) with the limits \(s^{ ( 0 ) } ( t )\), \(s^{ ( 1 ) } ( t )\) and \({s^{ ( 0 ) } ( \beta ,t ) }\), \({s^{ ( 1 ) } ( \beta ,t ) }\), we can see that (3.4) coincides with (3.6). Moreover, our assumptions (H.1)–(H.5) verify Conditions 3.1 and 3.2 without confliction if the uniform law of large numbers is applied by using the compactness of \(\beta ^{\ast }\) and the boundedness of covariates.

4 Oracle inequalities for estimation and prediction

As a powerful mathematical skill, oracle inequalities provide deep insight into the nonasymptotic fluctuation of an estimator compared to the unknown true parameter. A comprehensive theory of oracle inequalities in high-dimensional regressions has been developed for Lasso and its generalization, see Chap. 7 of Wainwright [26].

4.1 Key of nonasymptotic analysis

In this section, nonasymptotic oracle inequalities for weighted group Lasso estimates of Cox models are sought, as well as assumptions of the required restricted eigenvalue (such as group stabil condition). The proof leans on several steps:

  • Step1: To avoid ill behavior of Hessian, propose the restricted eigenvalue condition or other analogous conditions about the design matrix.

  • Step2: Find the tuning parameter based on high-probability event (KKT conditions or other KKT-like conditions).

  • Step3: According to some restricted eigenvalue assumptions and tuning parameter selection, derive the oracle inequalities via the definition of weighted group Lasso optimality and the minimizer under unknown expected risk function and some basic inequalities. There are three sub-steps:

    • (i) Under the KKT-like conditions, show that the error vector \(\hat{\beta }- \beta ^{*}\) is in some restricted set with structure sparsity, and moreover check that \(\hat{\beta }- \beta ^{*}\) is in a big compact set;

    • (ii) Show that likelihood-based divergence of β̂ and \(\beta ^{*}\) can be lower bounded by some quadratic distance between β̂ and \(\beta ^{*}\);

    • (iii) By some elementary inequalities and (ii), show that \({\sum_{g = 1}^{G_{n}} {{w_{g}}\| {\hat{\beta }_{n}^{g}}- \beta ^{*g}\|_{2}}}\) is in a smaller compact set with radius of optimal rate (proportional to λ).

As mentioned by one reviewer, our general framework of the proof is quite standard, but consecutive steps of defining some high-probability events rely on nontrivial new results. For simplicity, we introduce and use the notation in empirical processes, see van der Vaart and Wellner [25].

Let \(X_{1}, \ldots , X_{n}\) be a random sample from a measure \(\mathbb{P}\) on a measurable space \((\mathcal{X},\mathcal{A})\). We denote the empirical distribution as a discrete uniform measure \(\mathbb{P}_{n}=n^{-1} \sum_{i=1}^{n} \delta _{X_{i}}\), where \(\delta _{x}\) is the probability distribution that degenerates at x.

Given a measurable function \(f : \mathcal{X} \mapsto \mathbb{R}\), we write \(\mathbb{P}_{n} f\) for the expectation of f under the empirical measure \(\mathbb{P}_{n}\), and Pf for the expectation under P. Thus

$$ \mathbb{P}_{n} f=\frac{1}{n} \sum_{i=1}^{n} f (X_{i} ), \qquad P f= \int f \,dP. $$

The \(\mathbb{P}_{n} f\) is called empirical processes index by n. In fact, we treat \(\mathbb{P}_{n}\) and P as operators rather than the measure.

It follows from (2.8) and \(\mathbb{P}_{n} l(\beta ;T,z,\Delta ):=\tilde{\ell }_{n}(\beta ;T,z, \Delta )\) that

$$\begin{aligned} {\ell _{n}}(\beta ;T,z,\Delta )=\mathbb{P}_{n} l(\beta ;T,z,\Delta )- \frac{1}{n}\sum_{i = 1}^{n} { \biggl\{ {\log \frac{{{R_{n}}({T_{i}},\beta )}}{{R({T_{i}},\beta )}}} \biggr\} } {\Delta _{i}}. \end{aligned}$$
(4.1)

4.2 Define some events with high probability

Using the definition of \(\hat{\beta }_{n}\) in (3.3), we have

$$ \ell _{n}(\hat{\beta }_{n};T,z,\Delta )+\lambda {\sum _{g = 1}^{G_{n}} {{w_{g}} \bigl\Vert {\hat{\beta }_{n}^{g}} \bigr\Vert _{2}}} \le \ell _{n}\bigl(\beta ^{*};T,z, \Delta \bigr) +\lambda {\sum _{g = 1}^{G_{n}} {{w_{g}} \bigl\Vert {\beta ^{*g}} \bigr\Vert _{2}}}. $$
(4.2)

Hence we get

$$\begin{aligned} &\mathbb{P}\bigl( l(\hat{\beta }_{n};T,z,\Delta )-l\bigl(\beta ^{*};T,z, \Delta \bigr)\bigr)+\lambda {\sum _{g = 1}^{G_{n}} {{w_{g}} \bigl\Vert {\hat{ \beta }_{n}^{g}} \bigr\Vert _{2}}} \\ &\quad \le \bigl[{\ell _{n}}\bigl(\beta ^{*};T,z,\Delta \bigr) - \mathbb{P}l\bigl(\beta ^{*};T,z, \Delta \bigr)\bigr] - \bigl[{ \ell _{n}}(\hat{\beta }_{n};T,z,\Delta ) - \mathbb{P}l( \hat{ \beta }_{n};T,z,\Delta )\bigr] \\ &\qquad {}+\lambda {\sum _{g = 1}^{G_{n}} {{w_{g}} \bigl\Vert {\beta ^{*g}} \bigr\Vert _{2}}}. \end{aligned}$$
(4.3)

Then, by (4.1), the first and second terms in the right-hand side of (4.3) are

$$\begin{aligned}& \bigl[{\ell _{n}}\bigl(\beta ^{*};T,z,\Delta \bigr) - \mathbb{P} l\bigl(\beta ^{*};T,z, \Delta \bigr)\bigr]\\& \quad = \bigl[{ \mathbb{P}_{n}}l\bigl(\beta ^{*};T,z,\Delta \bigr) - \mathbb{P}l\bigl( \beta ^{*};T,z,\Delta \bigr)\bigr] - \frac{1}{n} \sum_{i = 1}^{n} { \biggl\{ {\log \frac{{{R_{n}}({T_{i}},{\beta ^{*}})}}{{R({T_{i}},{\beta ^{*}})}}} \biggr\} } {\Delta _{i}}, \\& \bigl[{\ell _{n}}(\hat{\beta }_{n};T,z,\Delta ) - \mathbb{P} l(\hat{\beta }_{n};T,z, \Delta )\bigr] \\& \quad = \bigl[{\mathbb{P}_{n}}l( \hat{\beta }_{n};T,z,\Delta ) - \mathbb{P}l(\hat{\beta }_{n};T,z,\Delta )\bigr] - \frac{1}{n}\sum _{i = 1}^{n} { \biggl\{ {\log \frac{{{R_{n}}({T_{i}},{{\hat{\beta }}_{n}})}}{{R({T_{i}},{{\hat{\beta }}_{n}})}}} \biggr\} } {\Delta _{i}}. \end{aligned}$$

It implies

$$\begin{aligned} &\bigl[{\ell _{n}}\bigl(\beta ^{*};T,z,\Delta \bigr) - \mathbb{P}l\bigl(\beta ^{*};T,z, \Delta \bigr)\bigr] - \bigl[{\ell _{n}}(\hat{\beta }_{n};T,z,\Delta ) - \mathbb{P}l( \hat{\beta }_{n};T,z,\Delta )\bigr] \\ &\quad =( \mathbb{P}_{n}-\mathbb{P}) \bigl( l\bigl(\beta ^{*};T,z,\Delta \bigr)-l( \hat{\beta }_{n};T,z,\Delta ) \bigr) -{D_{n}}\bigl(\hat{\beta },{\beta ^{*}}\bigr), \end{aligned}$$
(4.4)

where

$$ {D_{n} }\bigl(\hat{\beta },{\beta ^{*}}\bigr):=\Biggl[ \frac{1}{n}\sum_{i = 1}^{n} { \biggl\{ { \log \frac{{{R_{n}}({T_{i}},{\beta ^{*}})}}{{R({T_{i}},{\beta ^{*}})}}} \biggr\} } {\Delta _{i}} - \frac{1}{n} \sum_{i = 1}^{n} { \biggl\{ {\log \frac{{{R_{n}}({T_{i}},{{\hat{\beta }}_{n}})}}{{R({T_{i}},{{\hat{\beta }}_{n}})}}} \biggr\} } {\Delta _{i}}\Biggr]. $$

To obtain oracle inequalities for the weighed group Lasso applied to misspecified Cox models, it is necessary to study the rate of convergence of the empirical process \(( \mathbb{P}_{n}-\mathbb{P})( l(\beta ^{*};T,z,\Delta )-l( \hat{\beta }_{n};T,z,\Delta ))\) and \({D_{n}}(\hat{\beta },{\beta ^{*}})\). The centralized empirical loss \(( \mathbb{P}_{n}-\mathbb{P})( l(\beta ^{*};T,z,\Delta )-l( \hat{\beta }_{n};T,z,\Delta ) )\) and the normalized error \({D_{n}}(\hat{\beta },{\beta ^{*}})\) represent the fluctuation between the expected loss and sample loss. It will be shown that

$$ ( \mathbb{P}_{n}-\mathbb{P}) \bigl( l\bigl(\beta ^{*};T,z, \Delta \bigr)-l( \hat{\beta }_{n};T,z,\Delta )\bigr)\quad \text{and} \quad {D_{n}}\bigl(\hat{\beta },{\beta ^{*}}\bigr) $$

have stochastic Lipschitz properties with respect to \({\sum_{g = 1}^{G_{n}} {{w_{g}}\|{\hat{\beta }_{n}^{g}}-\beta ^{*g} \|_{2}}}\).

The concentration inequalities are essential tools to obtain an upper bound of (4.4), which is proportional to a regularization parameter that ensures good statistical properties of the regularized estimator with high probability.

Define \(F(s,z)\) as the joint distribution of \((T_{i},z_{i}^{\tau }(t))\). Let \(\tilde{\beta }: = {({{\tilde{\beta }}_{1}}, \ldots ,{{\tilde{\beta }}_{p}})^{T}}\) with the components \(\{ {{\tilde{\beta }}_{j}}\} _{j = 1}^{p}\) between \(\{ {{\hat{\beta }}_{j}}\} _{j = 1}^{p}\) and \(\{ \beta _{j}^{*}\} _{j = 1}^{p}\), respectively, via first-order Taylor’s expansions of the function

$$ {f_{t}}(\beta ) =\log R(t,{\beta })= \log {\mathrm{{E}}} \bigl[1(T_{i} \ge t){e^{{z_{i}^{\tau }}(t)\beta }}\bigr] = \log \int {1(s \ge t){e^{z_{i}^{\tau }(t)\beta }}\,dF(s,z)} $$

with derivative

$$ \frac{{d{f_{t}}(\beta )}}{{d{\beta _{j}}}} = \frac{{\int {z_{ij}^{\tau }(s)1(s \ge t){e^{z_{i}^{\tau }(t)\beta }}\,dF(s,z)} }}{{\int {1(s \ge t){e^{z_{i}^{\tau }(t)\beta }}\,dF(s,z)} }},\quad j = 1,2, \ldots ,p. $$

Plugging \(t=T_{i}\), we have componentwise Taylor’s expansion

$$ \log R\bigl(T_{i},{\beta ^{*}}\bigr) - \log R(T_{i},{{\hat{\beta }}_{n}}) = \frac{{\int {z_{ij}(s)1(s \ge T_{i}){e^{z_{i}^{\tau }(T_{i})\tilde{\beta }}}\,dF(s,z)} }}{{\int {1(s \ge T_{i}){e^{z_{i}^{\tau }(T_{i})\tilde{\beta }}}\,dF(s,z)} }}\bigl({ \beta ^{*}_{j}} - {{\hat{\beta }}_{j}}\bigr),\quad j = 1,2, \ldots ,p. $$

Considering the first term in (4.4), we have

$$\begin{aligned} &( \mathbb{P}_{n}-\mathbb{P}) \bigl( l\bigl(\beta ^{*};T,z, \Delta \bigr)-l( \hat{\beta }_{n};T,z,\Delta ) \bigr) \\ &\quad = - \frac{1}{n}\sum_{i = 1}^{n} {\bigl[z_{i}^{\tau }({T_{i}}){\beta ^{*}} - \log R\bigl({T_{i}},{\beta ^{*}}\bigr)\bigr]} {\Delta _{i}} + \frac{1}{n}\sum_{i = 1}^{n} {\bigl[z_{i}^{\tau }({T_{i}}){{\hat{\beta }}_{n}} - \log R({T_{i}},{{\hat{\beta }}_{n}}) \bigr]} {\Delta _{i}} \\ &\qquad{}- {\mathrm{{E}}}\bigl\{ {\bigl[z^{\tau }(T){\beta ^{*}} - \log R\bigl(T,{\beta ^{*}}\bigr)\bigr] \Delta }\bigr\} + { \mathrm{{E}}}\bigl\{ {\bigl[z^{\tau }(T){{\hat{\beta }}_{n}} - \log R(T,{{\hat{\beta }}_{n}})\bigr]\Delta }\bigr\} \\ &\quad = \frac{{ - 1}}{n}\sum_{i = 1}^{n} {\sum_{j = 1}^{p} {\bigl(\beta ^{*} - {{\hat{\beta }}}\bigr)} \bigl[{z_{ij}}({T_{i}}){\Delta _{i}} - {\mathrm{{E}}}\bigl({z_{ij}}({T_{i}}){ \Delta _{i}}\bigr)\bigr]} \\ &\qquad{}+ \frac{{ - 1}}{n}\sum_{i = 1}^{n} \sum_{j = 1}^{p} {\bigl(\beta _{j}^{*} - {{\hat{\beta }}_{j}}\bigr)} \biggl( \frac{{\int {z_{ij}(s)1(s \ge T_{i}){e^{z_{i}^{\tau }(T_{i})\tilde{\beta }}}\,dF(s,z)} }}{{\int {1(s \ge T_{i}){e^{z_{i}^{\tau }(T_{i})\tilde{\beta }}}\,dF(s,z)} }} \\ &\qquad {}- {\mathrm{{E}}} \frac{{\int {z_{ij}(s)1(s \ge T_{i}){e^{z_{i}^{\tau }(T_{i})\tilde{\beta }}}\,dF(s,z)} }}{{\int {1(s \ge T_{i}){e^{z_{i}^{\tau }(T_{i})\tilde{\beta }}}\,dF(s,z)} }} \biggr) \\ &\quad = \sum_{g = 1}^{G_{n}} {\bigl(\beta _{g}^{*} - {{\hat{\beta }}_{g}}\bigr)} \frac{{ - 1}}{n}\sum_{i = 1}^{n} {\biggl[ \frac{{z_{ig}^{\tau }({T_{i}})}}{{{w_{g}}}}{\Delta _{i}} - {\mathrm{{E}}}\bigl(z_{ig}^{\tau }({T_{i}}){ \Delta _{i}}\bigr)\biggr]} \\ &\qquad{}+ \sum_{g = 1}^{G_{n}} {\bigl(\beta _{g}^{*} - {{\hat{\beta }}_{g}}\bigr)} \frac{{ - 1}}{n}\sum_{i = 1}^{n} \biggl( \frac{{\int {z_{ig}(s)1(s \ge T_{i}){e^{z_{i}^{\tau }(T_{i})\tilde{\beta }}}\,dF(s,z)} }}{{\int {1(s \ge T_{i}){e^{z_{i}^{\tau }(T_{i})\tilde{\beta }}}\,dF(s,z)} }} \\ &\qquad {}- {\mathrm{{E}}} \frac{{\int {z_{ig}(s)1(s \ge T_{i}){e^{z_{i}^{\tau }(T_{i})\tilde{\beta }}}\,dF(s,z)} }}{{\int {1(s \ge T_{i}){e^{z_{i}^{\tau }(T_{i})\tilde{\beta }}}\,dF(s,z)} }} \biggr) \\ & \quad \le \sum_{g = 1}^{G_{n}} {{w_{g}} \bigl\Vert \beta _{g}^{*} - {\beta _{g}} \bigr\Vert _{2}} \cdot { \Biggl\Vert { \frac{1}{n}\sum_{i = 1}^{n} {\biggl[{ \Delta _{i}}\frac{{z_{ig}^{T}({T_{i}})}}{{{w_{g}}}} - {\mathrm{{E}}}\biggl( \frac{{z_{ig}^{\tau }({T_{i}})}}{{{w_{g}}}}{\Delta _{i}}\biggr)\biggr]} } \Biggr\Vert _{2}} \\ &\qquad{}+ \sum_{g = 1}^{G_{n}} {{w_{g}} \bigl\Vert \beta _{g}^{*} - {\beta _{g}} \bigr\Vert _{2}} \cdot \Biggl\Vert \frac{1}{n}\sum_{i = 1}^{n} \biggl( \frac{{\int {z_{ig}(s)1(s \ge T_{i}){e^{z_{i}^{\tau }(T_{i})\tilde{\beta }}}\,dF(s,z)} }}{{{w_{g}}\int {1(s \ge T_{i}){e^{z_{i}^{\tau }(T_{i})\tilde{\beta }}}\,dF(s,z)} }} \\ &\qquad {}- {\mathrm{{E}}} \frac{{\int {z_{ig}(s)1(s \ge T_{i}){e^{z_{i}^{\tau }(T_{i})\tilde{\beta }}}\,dF(s,z)} }}{{{w_{g}}\int {1(s \ge T_{i}){e^{z_{i}^{\tau }(T_{i})\tilde{\beta }}}\,dF(s,z)} }} \biggr) \Biggr\Vert _{2}. \end{aligned}$$
(4.5)

To get the stochastic Lipschitz properties, we define the following two events:

$$\begin{aligned}& {{\mathcal{A}}_{1}} =\bigcap_{g = 1}^{{G_{n}}} \Biggl\{ {{{ \Biggl\Vert {\frac{1}{n}\sum_{i = 1}^{n} {\biggl[ \frac{{z_{ig}^{\tau }({T_{i}})}}{{{w_{g}}}}{\Delta _{i}} - {\mathrm{{E}}}\biggl( \frac{{z_{ig}^{\tau }({T_{i}})}}{{{w_{g}}}}{\Delta _{i}}\biggr)\biggr]} } \Biggr\Vert }_{2}} \le {\lambda _{a1}}} \Biggr\} , \\& \begin{aligned} {{\mathcal{A}}_{2}}& = \bigcap_{g = 1}^{{G_{n}}} \Biggl\{ \Biggl\Vert \frac{1}{n}\sum_{i = 1}^{n} \biggl( \frac{{\int {z_{ig}(s)1(s \ge T_{i}){e^{z_{i}^{\tau }(T_{i})\tilde{\beta }}}\,dF(s,z)} }}{{{w_{g}}\int {1(s \ge T_{i}){e^{z_{i}^{\tau }(T_{i})\tilde{\beta }}}\,dF(s,z)} }}\\ &\quad {}- {\mathrm{{E}}} \frac{{\int {z_{ig}(s)1(s \ge T_{i}){e^{z_{i}^{\tau }(T_{i})\tilde{\beta }}}\,dF(s,z)} }}{{{w_{g}}\int {1(s \ge T_{i}){e^{z_{i}^{\tau }(T_{i})\tilde{\beta }}}\,dF(s,z)} }} \biggr) \Biggr\Vert _{2}\le {\lambda _{a2}} \Biggr\} . \end{aligned} \end{aligned}$$

The random sum in event \({{\mathcal{A}}_{2}}\) is not independent, which renders this problem more challenging. We need to check a uniform version of the event \({{\mathcal{A}}_{2}}\) in terms of β. Concentration inequalities for suprema empirical processes are powerful to check that event \({{\mathcal{A}}_{2}}\) holds with high probability. It will be derived from Talagrand’s sharper bounds for suprema empirical processes, which is a generalization of Dvoretzky–Kiefer–Wolfowitz inequality, see Talagrand [22]. Like an index function class for the empirical distribution function, boundedness assumption (H.1) on the components of \(z(t)\) guarantees the conditions for concentrations of suprema empirical processes.

Next, an upper bound is obtained for the centralized empirical process \(( \mathbb{P}_{n}-\mathbb{P}) [{l}(\beta ^{*}; T,z,\Delta )-l( \hat{\beta }_{n};T,z,\Delta )]\).

Proposition 4.1

Assume that (H.1)(H.3) are true. On the event \(\mathcal{A}={{\mathcal{A}}_{1}}\cap {{\mathcal{A}}_{2}}\), we have \(P(\mathcal{A})\ge 1-2d_{\mathrm{max}}(2G_{n})^{1-A^{2}}\). Moreover, the upper bound (4.6) holds with the probability as least \(1-2d_{\mathrm{max}}(2G_{n})^{1-A^{2}}\),

$$\begin{aligned} {{( \mathbb{P}_{n}-\mathbb{P}) \bigl({l}\bigl(\beta ^{*};T,z,\Delta \bigr)-l( \hat{\beta }_{n};T,z,\Delta ) \bigr)}} \le \lambda _{a}{\sum_{g = 1}^{G_{n}} {{w_{g}} \bigl\Vert {\hat{\beta }_{n}^{g}}- \beta ^{*g} \bigr\Vert _{2}}}, \end{aligned}$$
(4.6)

where \(\lambda _{a}:=\lambda _{a1}+\lambda _{a2}\) with

$$\begin{aligned} \begin{aligned}&\lambda _{a1} = \frac{{L\sqrt{2{d_{\max }}} }}{{w_{\min }}}\sqrt{ \frac{{\log (2{G_{n}})}}{n}} \quad \textit{and}\quad \\ &{\lambda _{a2}} = \frac{{2L\sqrt{2{d_{\max }}} }}{{{w_{\min }}}} \biggl( {\sqrt{\frac{{\log 2p}}{n} + } A{e^{2LB}}\sqrt{ \frac{{\log (2{G_{n}})}}{n}} } \biggr). \end{aligned} \end{aligned}$$
(4.7)

This proposition states that the difference between the centralized empirical processes is bounded from above by the tuning parameter multiplied by the weighted group Lasso norm of the difference between the estimated parameter and the true parameter \(\beta ^{*}\).

For the normalized error \({D_{n}}(\beta ,{\beta ^{*}})\), set

$$ \mathcal{B}= \biggl\{ \sup_{\beta \in {{\mathcal{S}}_{M}}(\beta ^{*})} \frac{ \lvert {D_{n} }(\beta ,{\beta ^{*}}) \rvert }{\sum_{g = 1}^{G_{n}} {{w_{g}} \Vert {\beta ^{g}}-\beta ^{*g} \Vert _{2}}} \le { \lambda _{a}}\biggr\} , $$

where \({D_{n}}(\beta ,{\beta ^{*}}):=\frac{1}{n} [\sum_{i = 1}^{n} {\{ {\log \frac{{{R_{n}}({T_{i}},{\beta ^{*}})}}{{R({T_{i}},{\beta ^{*}})}}} \}} -\sum_{i = 1}^{n} {\{ {\log \frac{{{R_{n}}({T_{i}},{{\beta }})}}{{R({T_{i}},{{\beta }})}}}\}} ]{\Delta _{i}}\) and \({\lambda _{a2}}\) is a suitable tuning parameter.

Observe that

$$\begin{aligned} {D_{n} }\bigl(\beta ,{\beta ^{*}}\bigr)&: = \Biggl\vert \frac{1}{n} \Biggl[\sum_{i = 1}^{n} { \biggl\{ {\log \frac{{{R_{n}}({T_{i}},{\beta ^{*}})}}{{R({T_{i}},{\beta ^{*}})}}} \biggr\} } -\sum_{i = 1}^{n} {\biggl\{ {\log \frac{{{R_{n}}({T_{i}},{{\beta }})}}{{R({T_{i}},{{\beta }})}}}\biggr\} } \Biggr]{\Delta _{i}} \Biggr\vert \\ &= \Biggl\vert {\frac{1}{n}\sum_{i = 1}^{n} { \Biggl[ {\log \frac{1}{n}\sum_{j = 1}^{n} {\frac{{{1} ( {{T_{j}} \ge {T_{i}}} ){\mathrm{e} ^{z_{i}^{\tau }({T_{i}}){\beta }}}}}{{R({T_{i}},{\beta })}} - \log \frac{1}{n}\sum_{j = 1}^{n} {\frac{{{1} ( {{T_{j}} \ge {T_{i}}} ){\mathrm{e} ^{z_{i}^{\tau }({T_{i}}){\beta ^{*}}}}}}{{R({T_{i}},{\beta ^{*}})}}} } } \Biggr]} {\Delta _{i}}} \Biggr\vert \\ & \le \sup_{0 \le t \le \tau } \Biggl\vert {\log \frac{1}{n}\sum _{j = 1}^{n} {\frac{{{1} ( {{T_{j}} \ge t} ){\mathrm{e} ^{z_{i}^{\tau }(t){\beta }}}}}{{R(t,{\beta })}} - \log \frac{1}{n}\sum_{j = 1}^{n} { \frac{{{1} ( {{T_{j}} \ge t} ){\mathrm{e} ^{z_{i}^{\tau }(t){\beta ^{*}}}}}}{{R(t,{\beta ^{*}})}}} } } \Biggr\vert \\ & = : \Biggl\vert {\log \frac{1}{n}\sum_{i = 1}^{n} {\frac{{\mathrm{1} ( {{T_{i}} \ge {t_{s}}} ){\mathrm{e} ^{z_{i}^{T}({t_{s}}){\beta }}}}}{{R({t_{s}},{\beta })}} - \log \frac{1}{n}\sum_{i = 1}^{n} {\frac{{\mathrm{1} ( {{T_{i}} \ge {t_{s}}} ){\mathrm{e} ^{z_{i}^{T}({t_{s}}){\beta ^{*}}}}}}{{R({t_{s}},{\beta ^{*}})}}} } } \Biggr\vert \end{aligned}$$
(4.8)

for certain random variable \({t_{s}}\) on a compact set \([0,\tau ]\).

By the first order Taylor’s expansion of the function \(g_{t_{s}}(\beta ): = \log ( \frac{1}{n}\sum_{i = 1}^{n} {\frac{{{1} ( {{T_{i}} \ge {t_{s}}} ){\mathrm{e} ^{z_{i}^{\tau }({t_{s}})\beta }}}}{{R({t_{s}},\beta )}}} )\), let the corresponding mean value \(\tilde{\beta }= {({{\tilde{\beta }}_{1}}, \ldots ,{{\tilde{\beta }}_{p}})^{T}}\) be between \(\beta _{j}^{*}\) and \(\beta _{j}\) for each \(j=1,2,\ldots ,p\). We have

$$\begin{aligned} &{D_{n} }\bigl(\beta ,{\beta ^{*}}\bigr) \\ &\quad = \Biggl\vert { \sum_{j = 1}^{p} {\bigl( \beta _{j}^{*} - {\beta _{j}}\bigr)} {{ \Biggl[ {\sum _{i = 1}^{n} {\frac{{{1} ( {{T_{i}} \ge {t_{s}}} ){\mathrm{e} ^{z_{i}^{\tau }({t_{s}})\tilde{\beta }}}}}{{R({t_{s}},\tilde{\beta })}}} } \Biggr]}^{ - 1}} } \\ &\qquad {}\times \sum_{i = 1}^{n} \biggl\{ \frac{{{1} ( {{T_{i}} \ge {t_{s}}} )z_{ij}({t_{s}}){\mathrm{e} ^{z_{i}^{\tau }({t_{s}}){\tilde{\beta }}}}R({t_{s}},\tilde{\beta })}}{{{R^{2}}({t_{s}},{\tilde{\beta }})}} \\ &\qquad {}- \frac{{{1} ( {{T_{i}} \ge {t_{s}}} ){\mathrm{e} ^{z_{i}^{\tau }({t_{s}}){\tilde{\beta }}}}\mathrm{E}[1(T \ge {t_{s}})z_{ij}({t_{s}}){\mathrm{e} ^{z^{\tau }({t_{s}})\tilde{\beta }}}]}}{{{R^{2}}({t_{s}},\tilde{\beta })}} \biggr\} \Biggr\vert \\ &\quad = \Biggl\vert {\sum_{j = 1}^{p} {\bigl( \beta _{j}^{*} - {\beta _{j}}\bigr)} \biggl\{ { \frac{{\frac{1}{n}\sum_{i = 1}^{n} {{1} ( {{T_{i}} \ge {t_{s}}} )z_{ij}({t_{s}}){\mathrm{e} ^{z_{i}^{\tau }({t_{s}}){\tilde{\beta }}}}} }}{{\frac{1}{n}\sum_{i = 1}^{n} {{1} ( {{T_{i}} \ge {t_{s}}} ){\mathrm{e} ^{z_{i}^{\tau }({t_{s}})\tilde{\beta }}}} }} - \frac{{\mathrm{E}[1(T \ge {t_{s}})z_{ij}(T){\mathrm{e} ^{z^{\tau }({t_{s}})\tilde{\beta }}}]}}{{\mathrm{E}[1(T \ge {t_{s}}){\mathrm{e} ^{z^{\tau }({t_{s}})\tilde{\beta }}}]}}} \biggr\} } \Biggr\vert \\ &\quad = \Biggl\vert {\sum_{g = 1}^{G} {{w_{g}} {{\bigl(\beta _{g}^{*} - {\beta _{g}}\bigr)}^{T}}} \biggl\{ {\frac{{\frac{1}{n}\sum_{i = 1}^{n} {{1} ( {{T_{i}} \ge {t_{s}}} )\frac{{z_{ig}({t_{s}})}}{{{w_{g}}}}{\mathrm{e} ^{z_{i}^{\tau }({t_{s}}){\tilde{\beta }}}}} }}{{\frac{1}{n}\sum_{i = 1}^{n} {{1} ( {{T_{i}} \ge {t_{s}}} ){\mathrm{e} ^{z_{i}^{\tau }({t_{s}})\tilde{\beta }}}} }} - \frac{{\mathrm{E}[1(T \ge {t_{s}})\frac{{{z_{ig}}({t_{s}})}}{{{w_{g}}}}{\mathrm{e} ^{z^{\tau }({t_{s}})\tilde{\beta }}}]}}{{\mathrm{E}[1(T \ge {t_{s}}){\mathrm{e} ^{z^{\tau }({t_{s}})\tilde{\beta }}}]}}} \biggr\} } \Biggr\vert \\ &\quad \le \sum_{g = 1}^{G} {{w_{g}} \bigl\Vert \beta _{g}^{*} - {\beta _{g}} \bigr\Vert _{2}} \biggl\Vert \frac{{\sum_{i = 1}^{n} {{1} ( {{T_{i}} \ge {t_{s}}} )\frac{{z_{ig}({t_{s}})}}{{{w_{g}}}}{\mathrm{e} ^{z_{i}^{\tau }({t_{s}}){\tilde{\beta }}}}} }}{{\sum_{i = 1}^{n} {{1} ( {{T_{i}} \ge {t_{s}}} ){\mathrm{e} ^{z_{i}^{\tau }({t_{s}})\tilde{\beta }}}} }} \\ &\qquad {} - \frac{{\mathrm{E}[1(T \ge {t_{s}})\frac{{{z_{ig}}({t_{s}})}}{{{w_{g}}}}{\mathrm{e} ^{z^{\tau }({t_{s}})\tilde{\beta }}}]}}{{\mathrm{E}[1(T \ge {t_{s}}){\mathrm{e} ^{z^{\tau }({t_{s}})\tilde{\beta }}}]}} \biggr\Vert _{2}. \end{aligned}$$
(4.9)

From the following decomposition and inequality

$$ \biggl\Vert {\frac{{{a_{n}}}}{{{b_{n}}}} - \frac{a}{b}} \biggr\Vert _{2} = \biggl\Vert {\frac{1}{{{b_{n}}}} \biggl[ {({a_{n}} - a) + \frac{a}{b}({b_{n}} - b)} \biggr]} \biggr\Vert _{2} \le \frac{1}{{ \vert {{b_{n}}} \vert }} \biggl( { \Vert {{a_{n}} - a} \Vert _{2} + \frac{{ \Vert a \Vert _{2} }}{{ \vert b \vert }} \vert {{b_{n}} - b} \vert } \biggr), $$

which implies that

$$\begin{aligned} &{D_{n} }\bigl(\beta ,{\beta ^{*}}\bigr) \\ &\quad = { \Biggl\vert {\frac{1}{n}\sum_{i = 1}^{n} {{1} ( {{T_{i}} \ge {t_{s}}} ){{{\mathrm{e}} } ^{z_{i}^{\tau }({t_{s}})\tilde{\beta }}}} } \Biggr\vert ^{ - 1}}\Biggl\{ \sum_{g = 1}^{G} {{w_{g}} \bigl\Vert \beta _{g}^{*} - {\beta _{g}} \bigr\Vert _{2}} \\ &\qquad {}\times \Biggl[ \Biggl\Vert { \frac{1}{n}\sum_{i = 1}^{n} { \frac{{{1} ( {{T_{i}} \ge {t_{s}}} )z_{ig}({t_{s}})}}{{{w_{g}}}}{{{\mathrm{e}} } ^{z_{i}^{\tau }({t_{s}})\tilde{\beta }}}} - \mathrm{E}\biggl[1(T \ge {t_{s}})\frac{{{z_{ig}}({t_{s}})}}{{{w_{g}}}}{{{\mathrm{e}} } ^{z^{\tau }({t_{s}})\tilde{\beta }}}\biggr]} \Biggr\Vert _{2} \\ &\qquad{}+ \frac{{ \Vert {\mathrm{E}[1(T \ge {t_{s}})\frac{{{z_{ig}}({t_{s}})}}{{{w_{g}}}}{\mathrm{e} ^{z^{\tau }({t_{s}})\tilde{\beta }}}]} \Vert _{2} }}{{ \vert {\mathrm{E}[1(T \ge {t_{s}}){\mathrm{e} ^{z^{\tau }({t_{s}})\tilde{\beta }}}]} \vert }} \\ &\quad {}\times\Biggl\vert {\frac{1}{n} \sum_{i = 1}^{n} {{1} ( {{T_{i}} \ge {t_{s}}} ){\mathrm{e} ^{z_{i}^{\tau }({t_{s}})\tilde{\beta }}} - \mathrm{E}\bigl[1(T \ge {t_{s}}){{{\mathrm{e}} } ^{z^{\tau }({t_{s}})\tilde{\beta }}}\bigr]} } \Biggr\vert \Biggr] \Biggr\} \\ &\quad \le { \Biggl\vert {\frac{1}{n}\sum_{i = 1}^{n} {{1} ( {{T_{i}} \ge {t_{s}}} ){\mathrm{e} ^{z_{i}^{\tau }({t_{s}}) \tilde{\beta }}}} } \Biggr\vert ^{ - 1}}\sum_{g = 1}^{G} {{w_{g}} \bigl\Vert \beta _{g}^{*} - {\beta _{g}} \bigr\Vert _{2}} \\ &\qquad {}\times\Biggl\{ \Biggl\Vert \frac{1}{n}\sum_{i = 1}^{n} { \frac{{{1} ( {{T_{i}} \ge {t_{s}}} )z_{ig}({t_{s}})}}{{{w_{g}}}}{{{\mathrm{e}} } ^{z_{i}^{\tau }({t_{s}})\tilde{\beta }}}} - \mathrm{E}\biggl[1(T \ge {t_{s}})\frac{{{z_{ig}}({t_{s}})}}{{{w_{g}}}}{{{\mathrm{e}} } ^{z^{\tau }({t_{s}})\tilde{\beta }}}\biggr] \Biggr\Vert _{2} \\ &\qquad{}+ \frac{{L\sqrt{{d_{g}}} }}{{{w_{\min }}}} { { \Biggl\vert {\frac{1}{n} \sum_{i = 1}^{n} {{1} ( {{T_{i}} \ge {t_{s}}} ){\mathrm{e} ^{z_{i}^{\tau }({t_{s}}) \tilde{\beta }}} - \mathrm{E}\bigl[1(T \ge {t_{s}}){{{\mathrm{e}} } ^{z^{\tau }({t_{s}})\tilde{\beta }}}\bigr]} } \Biggr\vert } } \Biggr\} , \end{aligned}$$
(4.10)

where the last inequality is from

$$ \frac{{ \Vert {{\mathrm{{E}}}[1(T \ge {t_{s}})\frac{{{z_{ig}}({t_{s}})}}{{{w_{g}}}}{{\mathrm{{e}}}^{{z^{\tau }}({t_{s}})\tilde{\beta }}}]} \Vert }}{{ \vert {{\mathrm{{E}}}[1(T \ge {t_{s}}){{\mathrm{{e}}}^{{z^{\tau }}({t_{s}})\tilde{\beta }}}]} \vert }} = \frac{1}{{{w_{g}}}}\sqrt{\sum _{j = 1}^{{d_{g}}} {{{\mathrm{{\biggl(E}}}\biggl[ \frac{{1(T \ge {t_{s}}){z_{ij}}({t_{s}}){{\mathrm{{e}}}^{{z^{\tau }}({t_{s}})\tilde{\beta }}}}}{{{\mathrm{{E}}}[1(T \ge {t_{s}}){{\mathrm{{e}}}^{{z^{\tau }}({t_{s}})\tilde{\beta }}}]}}\biggr]\biggr)^{2}}} } \le \frac{{L\sqrt{{d_{g}}} }}{{{w_{\min }}}} $$

by using assumptions (H.1)–(H.2).

If we have \(\hat{\beta }\in {{\mathcal{S}}_{M}}(\beta ^{*})\) for some finite M, thus \(\tilde{\beta }\in {{\mathcal{S}}_{M}}(\beta ^{*})\) by

$$ {\sum_{g = 1}^{G_{n}} {{w_{g}}\bigl\| \tilde{\beta }^{g} - {\beta ^{g*}}}} \bigr\| _{2} \le \sum_{g = 1}^{{G_{n}}} {{w_{g}}} \sqrt {\sum_{j = 1}^{{d_{g}}} {t_{j}^{2}\bigl|{{\hat{\beta }}_{j}} - \beta _{j}^{*}\bigr|^{2}} } \le {\sum _{g = 1}^{G_{n}} {{w_{g}}\bigl\| {\hat{\beta }_{n}^{g}} - {\beta ^{*g}}}}\bigr\| _{2} \le {M}. $$

Note that summation (4.10) contains a common random variable \({t_{s}}\) which renders (4.10) to be a dependent summation. In order to bound the quotient and the two centralized summations, we denote three events by \({{\mathcal{B}}_{0}}\), \({{\mathcal{B}}_{1}}\), \({{\mathcal{B}}_{2}}\), respectively:

$$\begin{aligned}& {{\mathcal{B}}_{0}} = \Biggl\{ \sup_{\substack{ {t_{s}} \in [0,\tau ], \\ {\beta \in {{\mathcal{S}}_{M}}(\beta ^{*})} }}{ \frac{1}{n}\sum_{j = 1}^{n} {{1} ( {{T_{j}} \ge {t_{s}}} ){\mathrm{e} ^{z_{i}^{\tau }({t_{s}})\beta }}} \ge U} \Biggr\} , \\& {{\mathcal{B}}_{1}} = \Biggl\{ \bigcap_{g = 1}^{{G_{n}}}{{ \Biggl\Vert \frac{1}{n}\sum_{i = 1}^{n} {{1} ( {{T_{i}} \ge {t_{s}}} ) \frac{{z_{ig}({t_{s}})}{{\mathrm{{e}}}^{z_{i}^{\tau }({t_{s}})\beta }}}{{{w_{g}}}}} - { \mathrm{{E}}}\biggl[1(T \ge {t_{s}}) \frac{{{z_{ig}}({t_{s}})}{{\mathrm{{e}}}^{{z^{\tau }}({t_{s}}) \beta }}}{{{w_{g}}}}\biggr] \Biggr\Vert }_{2}} \le {\lambda _{b1}}U \Biggr\} , \end{aligned}$$

and

$$\begin{aligned} {{\mathcal{B}}_{2}} &= \Biggl\{ \sup _{\substack{ {t_{s}} \in [0,\tau ], \\ {\beta \in {{\mathcal{S}}_{M}}(\beta ^{*})} }} \Biggl\vert {\frac{1}{n}\sum _{i = 1}^{n} {{1} ( {{T_{i}} \ge {t_{s}}} ){\mathrm{e} ^{z_{i}^{\tau }({t_{s}}) \tilde{\beta }}} - \mathrm{E}\bigl[1(T \ge {t_{s}}){{{\mathrm{e}} } ^{z^{\tau }({t_{s}})\tilde{\beta }}}\bigr]} } \Biggr\vert \le {\lambda _{b2}}U \Biggr\} . \end{aligned}$$
(4.11)

To solve the problem, we need the concentration inequalities for the suprema of the empirical processes in \(\{ {\mathcal{B}}_{l}\} _{l = 0}^{2}\) uniformly in \(t\in [0,\tau ]\) and \(\beta \in {{\mathcal{S}}_{M}}(\beta ^{*})\), see Sect. 2.14 of van der Vaart and Wellner [25].

Let \({\mathcal{B}} = {{\mathcal{B}}_{0}} \cap {{\mathcal{B}}_{1}} \cap {{\mathcal{B}}_{2}}\). We aim to show that each event in \(\{{\mathcal{B}}_{l}\}_{l=0}^{2}\) holds with high probability. Thus \({\mathcal{B}}\) is also a high probability event via utilizing the basic inequality \(P({\mathcal{B}})\ge P({{\mathcal{B}}_{0}}) + P({{\mathcal{B}}_{1}}) + P({{\mathcal{B}}_{2}}) - 2\).

Based on (4.10), we obtain the following local stochastic Lipschitz condition under the event \({\mathcal{B}}\):

$$ \frac{ \vert {D_{n} }(\hat{\beta },{\beta ^{*}}) \vert }{\sum_{g = 1}^{G_{n}} {{w_{g}} \Vert {\hat{\beta }_{n}^{g}}-\beta ^{*g} \Vert _{2}}} \le \sup_{\beta \in {{\mathcal{S}}_{M}}(\beta ^{*})} \frac{ \vert {D_{n} }(\beta ,{\beta ^{*}}) \vert }{\sum_{g = 1}^{G_{n}} {{w_{g}} \Vert {\beta ^{g}}-\beta ^{*g} \Vert _{2}}} \le { \lambda _{b}}, $$

where \({\lambda _{b}}\) can be viewed as the local stochastic Lipschitz constant.

The following proposition is a similar but significant improvement of Corollary 2 in Kong and Nan [16] from the Lasso to the group Lasso case and from the fixed design to the random design.

Proposition 4.2

Let \(p_{\tau }:={{P({T_{1}} \ge \tau )}}>0\), and \({D^{2}}(\sqrt {2})\) be a universal constant. Under (H.1)(H.3) and some constant \(A^{2}>2\), we have \(P (\mathcal{B} ) \ge 1-2{e^{ - np_{\tau }^{2}/2}}- \frac{{{d_{{\mathrm{{max}}}}}{D^{2}}(\sqrt {2}){A^{2}}\log ({G_{n}})}}{{4n}} {G_{n}^{2 - {A^{2}}}}-\frac{{{D^{2}}(\sqrt {2}){A^{2}}\log p}}{{4n}}{p^{ - {A^{2}}}}\) with

$$\begin{aligned} {\lambda _{b1}} = \frac{{2\sqrt{2} LA{{\mathrm{{e}}}^{2LB}}\sqrt{{d_{\max }}} }}{{{p_{\tau }}{w_{\min }}}} \sqrt{ \frac{{\log ({G_{n}})}}{n}} \quad \textit{and}\quad {\lambda _{b2}} = \frac{{\sqrt{2} A{{\mathrm{{e}}}^{2LB}}}}{{{p_{\tau }}}}\sqrt{\frac{{\log p}}{n}}. \end{aligned}$$
(4.12)

Moreover, let \(\lambda _{b}:=\lambda _{b1}+\lambda _{b2}\), we have

$$ {D_{n} }\bigl(\hat{\beta },{\beta ^{*}}\bigr)\le {\lambda _{b}} {\sum_{g = 1}^{G_{n}} {{w_{g}} \bigl\Vert {\hat{\beta }_{n}^{g}}- \beta ^{*g} \bigr\Vert _{2}}} $$

with probability at least \(1-2{e^{ - np_{\tau }^{2}/2}}- \frac{{{d_{{\mathrm{{max}}}}}{D^{2}}(\sqrt {2}){A^{2}}\log }}{{4n}}{G_{n}^{2 - {A^{2}}}}- \frac{{{D^{2}}(\sqrt {2}){A^{2}}\log p}}{{4n}}{p^{ - {A^{2}}}}\).

If the true model is sparse and \(\log p =o(n)\), then the two propositions above illustrate that \({P}( \mathcal{A}),{P}( \mathcal{B}) \to 1\) as \(p,n \to \infty \).

4.3 Sharp oracle inequalities from restricted eigenvalue conditions

In this section, we give sharp bounds for estimation and prediction errors for Cox models using a weaker condition similar to the restricted eigenvalue condition of Bickel et al. [4].

Consider linear models \(\{{\mathrm{{E}}} [Y_{i}|X_{i}]=X_{i}^{\tau }(t){\beta ^{*}}\}_{i=1}^{n}\) with random covariate vectors \(\{{X}_{i}\}_{i=1}^{n}\). The key condition to derive oracle inequalities rests on the correlation between the covariates, i.e., on the behavior of the sample covariance matrix \({ \Sigma }_{n}=\frac{1}{n} \sum_{i = 1}^{n} {{{ {X}}_{i}}{{X}_{i}^{T}}}\), which is necessarily singular when \(p>n\). Let S be any subset of \(\{1,2,\ldots ,p\}\). The restricted eigenvalue condition (RE in short) of \(p \times p\) matrix \({ \Sigma }_{n}\) is defined by

$$ RE(\eta ,S,{\Sigma }_{n}) = \inf_{0 \ne {b} \in {\mathrm{{C}}}(\eta ,S)} \frac{{{{({{b}^{T}}{ \Sigma }_{n} {b})}^{1/2}}}}{{{ \Vert {b} \Vert _{2}}}} > 0, $$
(4.13)

where \({\mathrm{{C}}}(\eta ,S)=\{ {b} \in {\mathbb{R}^{p}}:{\| {{{b}_{{S^{c}}}}} \|_{1}} \le \eta {\| {{{b}_{S}}} \|_{1}}\}\), \(\eta >0\).

It should be noted that if we omit the sparse restricted set \({\mathrm{{C}}}(\eta ,S)\), (4.13) leads to \(\frac{{{{{{b}^{T}}{ \Sigma }_{n} {b}}}}}{ \Vert {b} \Vert _{2}^{2}}>RE^{2}( \eta ,S,{\Sigma }_{n})\). Thus it means that the smallest eigenvalue of the sample covariance matrix \({\boldsymbol{\Sigma }}_{n}\) is positive, which is impossible when \(p>n\) (\({\boldsymbol{\Sigma }}_{n}\) is not full rank). To avoid the low rank of \({\boldsymbol{\Sigma }}_{n}\), Bickel et al. [4] consider the restricted eigenvalue condition under the sparse restricted set \({\mathrm{{C}}}(\eta ,S)\) as considerable relation in the sparse high-dimensional estimation. The restricted eigenvalue is from the restricted strong convexity, which enforces a type of strong convexity condition for the negative log-likelihood function of linear models under certain sparse restrict set.

A shortcoming for (4.13) is that we cannot assume that \(RE(\eta ,S,{\Sigma }_{n})>0\) happens with high probability 1. Instead, we replace \({\Sigma }_{n}\) by a non-random version: \({ \Sigma }={\mathrm{{E}}}{\Sigma }_{n}\). Observe that \(\frac{{{{{{b}^{T}}{ \Sigma }_{n} {b}}}}}{ \Vert {b}_{S} \Vert _{2}^{2}} \ge \frac{{{{{{b}^{T}}{ \Sigma }_{n} {b}}}}}{ \Vert {b} \Vert _{2}^{2}}>0\) if (4.13) holds. So \({{{{{{b}^{T}}{ \Sigma }_{n} {b}}}}}\ge k { \Vert {b}_{S} \Vert _{2}^{2}}>k { \Vert {b}_{S} \Vert _{2}^{2}}-\varepsilon \) for a constant \(k>0\) and a relax constant \(\varepsilon >0\). Technically, for group penalty, here we use a condition which is a modified version of the restricted eigenvalue conditions presented in Blazere et al. [5] for generalized linear models. Define by \(H^{*}= \{ g: \beta ^{* g} \neq 0 \} \) the index set of the groups and \(\gamma ^{*}:= \vert H^{*} \vert \)

Definition

(Group stabil condition)

Let \(c_{0},\varepsilon ,k >0\) be given constants. Let Σ be the \(p \times p\) non-random matrix, which satisfies the group stabil condition \(GS(c_{0},\varepsilon ,k,H^{*})\) if there exists \(k>0\) such that

$$\begin{aligned} \delta ^{T} \Sigma \delta \geqslant k\sum _{g \in H^{*} } \bigl\Vert \delta ^{g} \bigr\Vert _{2}^{2}-\varepsilon , \quad \forall \delta \in S \bigl(c_{0},H^{*}\bigr), \end{aligned}$$
(4.14)

where the restricted set is defined as \(S(c_{0},H^{*}):=\{ \delta : \sum_{g \in {H^{*}}^{c} }{w_{g}}\Vert \delta ^{g}\Vert _{2}\le c_{0}\sum_{g \in H^{*} }{w_{g}}\Vert \delta ^{g}\Vert _{2} \} \).

\(S(c_{0},H^{*})\) is a restricted cone set with group sparsity, which is similar to the condition used by Lounici et al. [18] to prove oracle inequalities for group Lasso in linear models. The ε is an error or relax term that can be set to zero, and we can view k as the smallest generalized eigenvalue of Σ.

If we assume that the group stabil condition is satisfied for the covariance matrix \(\Sigma :=\mathrm{E}[{z(t)}{z^{{\tau }}(t)}]\) under the restricted cone set \(S(c_{0},H^{*})\) with \(\delta =\hat{\beta }_{n}-\beta ^{*}\), then we check that \(\hat{\beta }_{n}-\beta ^{*} \in S(1,H^{*})\) holds with high probability. With the preparation above, we are now able to present the main result of this paper, which provides sharper and minimax optimal bounds for the estimation and prediction error when the true model is sparse and logp is small as compared to n.

Theorem 4.1

Let \(\gamma ^{*}:=\sum_{g\in H^{*}}d_{g}\), \(p_{\tau }:={{P({T_{1}} \ge \tau )}}>0\) and \({D^{2}}(\sqrt {2})\) be a universal constant. Assume that (H.1)(H.4) and group stabil condition \(GS(1,\varepsilon _{n},k,H^{*})\) are satisfied for \(\Sigma :=\mathrm{E}[{z(t)}{z^{{\tau }}(t)}]\). If λ is chosen such that

$$ \lambda \ge \lambda _{a1}+\lambda _{a2}+\lambda _{b1}+\lambda _{b2}\quad \textit{given by }\text{ (4.7)}\textit{ and }\text{(4.12)}. $$

Then, with probability at least \((A^{2}>2\bigr)\)

$$ \begin{gathered} 1 - 2{d_{{\mathrm{{max}}}}} {(2{G_{n}})^{ - {A^{2}}/2}} - 2{e^{ - np_{\tau }^{2}/2}}- \frac{{{d_{{\mathrm{{max}}}}}{D^{2}}(\sqrt {2}){A^{2}}\log ({G_{n}})}}{{4n}}{G_{n}^{2 - {A^{2}}}}- \frac{{{D^{2}}(\sqrt {2}){A^{2}}\log p}}{{4n}}{p^{ - {A^{2}}}}, \end{gathered} $$

we have \(\hat{\beta }_{n}-\beta ^{*} \in S(1,H^{*})\) and

$$ \sum_{g=1}^{G_{n}}{w_{g}} \bigl\Vert \hat{\beta }_{n}^{g}-{\beta ^{*}}^{g} \bigr\Vert _{2}\le \frac{8\gamma ^{*}\lambda }{kc_{l}}+ \frac{c_{l}\varepsilon _{n}}{2\lambda }, $$

where \(c_{1}>0\) is a constant given in (H.4).

Moreover, if a new covariate \({z^{*}}(t)\) (the test data) is an independent copy of \({z}(t)\) (as the training data) and \(\mathrm{E}^{*}\) represents expectation only about \({z^{*}}(t)\), then the square prediction error under \(\Delta =1\) is

$$ {{\mathrm{{E}}}^{*}} {\bigl[{z^{*\tau }}(t) \bigl({{\hat{\beta }}_{n}} - {\beta ^{*}}\bigr)\bigr]^{2}} \le \frac{{32{\gamma ^{*}}{\lambda ^{2}}}}{{kc_{\mathrm{{l}}}^{2}}} + 2{\varepsilon _{n}} $$

under the event \({{\mathcal{A}}}\cap {{\mathcal{B}}}\).

Consider \({\varepsilon _{n}}=0\). The obtained results are for the fixed design which is analogous to the bounds in Lounici et al. [18] who show the optimal convergence rate of the group Lasso estimator for linear models under the fixed design. Note that if \(\gamma ^{*}=O(1)\) then the bound on the estimation error is of the order \(O ( \sqrt{ \frac{\log p}{n}} )+O ( \sqrt{\frac{\log (G_{n})}{n}} )\) and the weighted group Lasso estimator still remains consistent for the \(\ell _{2,1}\)-estimation error and for the square prediction error under the group stabil condition if the number of groups increases almost as fast as \(e^{o(n)}\). The terms \(\sqrt{\log p}\) and \(\sqrt{\log {G_{n}}}\) are the price to pay for the unknown group sparsity of \({\beta ^{*}}\). If the relax error \({\varepsilon _{n}}\) is a big order of λ, it leads to the convergence rate \({\varepsilon _{n}}\) for the estimation error \(\sum_{g=1}^{G_{n}}{w_{g}}\Vert \hat{\beta }_{n}^{g}-{\beta ^{*}}^{g} \Vert _{2}\).

From Theorem 4.1, if all \(d_{g}=1\), it enables us to derive analogous results for un-weighted Lasso penalty in what follows.

Corollary 4.1

Let \(\gamma ^{*}:=\|\beta ^{*}\|_{0}\), \(p_{\tau }:={{P({T_{1}} \ge \tau )}}>0\), and \({D^{2}}(\sqrt {2})\) be a universal constant in the proof. Assume that (H.1)(H.4) and condition \(GS(1,\varepsilon _{n},k)\) are fulfilled for \(\Sigma :=\mathrm{E}[{z(t)}{z^{{\tau }}(t)}]\). If λ is chosen such that

$$ \lambda \ge {\sqrt{2}} L\bigl(3+2A{e^{2LB}}\bigr)\sqrt{ \frac{{\log (2p)}}{n}}+ \frac{\sqrt{2}({2 L+1)A{{\mathrm{{e}}}^{2LB}} }}{p_{\tau }}\sqrt{\frac{{\log p}}{n}}. $$

Then, with probability at least

$$ 1 - 2{(2p)^{ - {A^{2}}/2}} - 2{e^{ - np_{\tau }^{2}/2}}- \frac{{{D^{2}}(\sqrt {2}){A^{2}}\log p}}{{4n}}{p^{2 - {A^{2}}}}- \frac{{{D^{2}}(\sqrt {2}){A^{2}}\log p}}{{4n}}{p^{ - {A^{2}}}}\quad \bigl(A^{2}>2\bigr), $$

we have \(\hat{\beta }_{n}-\beta ^{*} \in S(1,H^{*})\) and

$$ \bigl\Vert \hat{\beta }_{n}-{\beta ^{*}} \bigr\Vert _{1} \le \frac{8\gamma ^{*}\lambda }{kc_{l}}+ \frac{c_{l}\varepsilon _{n}}{2\lambda }, \qquad {{ \mathrm{{E}}}^{*}} {\bigl[{z^{* \tau }}(t) \bigl({{\hat{\beta }}_{n}} - {\beta ^{*}}\bigr)\bigr]^{2}} \le \frac{{32{\gamma ^{*}}{\lambda ^{2}}}}{{kc_{\mathrm{{l}}}^{2}}} + 2{\varepsilon _{n}}. $$

Corollary 4.1 presents an upper bound of the \(\ell _{1}\)-estimation error, which is similar to the existing result in Theorem 3.2 in Huang et al. [13] for classical Lasso penalized Cox models. The advantages of Corollary 4.1 are that the restricted eigenvalue condition is not stochastic and Theorem 3.2 in Huang et al. [13] requires further analysis of the restricted eigenvalue condition to guarantee a high-probability event. Another significant difference is that oracle inequalities in Huang et al. [13] require that the sample size is larger than a given constant. Our oracle inequalities are valid for any finite n under the given high-probability event.

5 Proofs

5.1 Proofs of Theorem 4.1

The proof is based on the following three steps.

Step1: Check \(\hat{\beta }_{n}-\beta ^{*} \in S(1,H^{*})\).

Using Proposition 4.1 and Proposition 4.2 to bound the empirical process on the event \(\mathcal{A}\cap \mathcal{B}\) by (4.4), we have

$$\begin{aligned} &( \mathbb{P}_{n}-\mathbb{P}) \bigl( {\ell }\bigl(\beta ^{*};T,z,\Delta \bigr)-{\ell }(\hat{\beta }_{n};T,z,\Delta ) \bigr) \\ &\quad =( \mathbb{P}_{n}- \mathbb{P}) \bigl( l\bigl(\beta ^{*};T,z,\Delta \bigr)-l(\hat{\beta }_{n};T,z, \Delta ) \bigr) -{D_{n}}\bigl(\hat{\beta },{\beta ^{*}}\bigr) \\ &\quad \le (\lambda _{a}+\lambda _{b})\sum _{g=1}^{G_{n}}{w_{g}} \bigl\Vert \hat{\beta }_{n}^{g}-{\beta ^{*}}^{g} \bigr\Vert _{2}=\lambda \sum_{g=1}^{G_{n}}{w_{g}} \bigl\Vert \hat{\beta }_{n}^{g}-{\beta ^{*}}^{g} \bigr\Vert _{2}. \end{aligned}$$
(5.1)

From (4.3), (5.1) implies

$$\begin{aligned} &\mathbb{P}\bigl( l(\hat{\beta }_{n};T,z,\Delta )-l \bigl(\beta ^{*};T,z,\Delta \bigr)\bigr)+ \lambda \sum _{g = 1}^{G_{n}} {w_{g}} \Vert \hat{\beta }_{n} \Vert _{2} \\ &\quad \le \lambda \sum _{g=1}^{G_{n}}{w_{g}} \bigl\Vert \hat{\beta }_{n}^{g}-{\beta ^{*}}^{g} \bigr\Vert _{2}+\lambda {\sum_{g = 1}^{G_{n}} {{w_{g}} \bigl\Vert {\beta ^{*}}^{g} \bigr\Vert _{2}}}. \end{aligned}$$
(5.2)

By adding \(\lambda \sum_{g=1}^{G_{n}}{w_{g}}\Vert \hat{\beta }_{n}^{g}-{\beta ^{*}}^{g} \Vert _{2}\) to both sides of inequality (5.2), on \(\mathcal{A}\cap \mathcal{B}\), we can obtain that

$$ \begin{aligned}[b] &\lambda \sum_{g=1}^{G_{n}}{w_{g}} \bigl\Vert \hat{\beta }_{n}^{g}-{\beta ^{*}}^{g} \bigr\Vert _{2}+\mathbb{P}\bigl( l( \hat{\beta }_{n};T,z,\Delta )-l\bigl(\beta ^{*};T,z, \Delta \bigr)\bigr) \\ &\quad \le \lambda \sum_{g=1}^{G_{n}}{w_{g}} \bigl( \bigl\Vert \hat{\beta }_{n}^{g}-{\beta ^{*}}^{g} \bigr\Vert _{2}+ \bigl\Vert {\beta ^{*}}^{g} \bigr\Vert _{2}- \bigl\Vert \hat{ \beta }_{n}^{g} \bigr\Vert _{2}\bigr). \end{aligned} $$
(5.3)

If \(g\notin H^{*}\), then \(\Vert \hat{\beta }_{n}^{g}- {\beta ^{*}}^{g}\Vert _{2} +\Vert {\beta ^{*}}^{g} \Vert _{2} - \Vert \hat{\beta }_{n}^{g}\Vert =0\), and otherwise \(\Vert {\beta ^{*}}^{g}\Vert _{2} - \Vert \hat{\beta }_{n}^{g}\Vert _{2} \le \Vert \hat{\beta }_{n}^{g}- {\beta ^{*}}^{g}\Vert _{2}\). So the last term in inequality (5.3) can be rewritten as

$$ \lambda \sum_{g=1}^{G_{n}}{w_{g}} \bigl\Vert \hat{\beta }_{n}^{g}-{\beta ^{*}}^{g} \bigr\Vert _{2}+\mathbb{P}\bigl( l( \hat{\beta }_{n};T,z,\Delta )-l\bigl(\beta ^{*};T,z, \Delta \bigr)\bigr)\le 2\lambda \sum_{g\in H^{*}}{w_{g}} \bigl\Vert \hat{\beta }_{n}^{g}-{\beta ^{*}}^{g} \bigr\Vert _{2}. $$
(5.4)

By the definition of \(\beta ^{*}\), we have \(\mathbb{P}( l(\hat{\beta }_{n};T,z,\Delta )-l(\beta ^{*};T,z,\Delta ))>0\) and therefore

$$ \sum_{g\notin H^{*}}{w_{g}} \bigl\Vert \hat{ \beta }_{n}^{g}-{\beta ^{*}}^{g} \bigr\Vert _{2}\le \sum_{g\in H^{*}}{w_{g}} \bigl\Vert \hat{\beta }_{n}^{g}-{\beta ^{*}}^{g} \bigr\Vert _{2}, $$

i.e., \(\hat{\beta }_{n}-\beta ^{*} \in S(1,H^{*})\).

Step2: Find a lower bound for \(\mathbb{P}( l(\hat{\beta }_{n};T,z,\Delta )-l(\beta ^{*};T,z,\Delta ))\) .

The next proposition provides the desired lower bound.

Proposition 5.1

Under (H.4), conditioning on \(\Delta =1\), we have

$$\begin{aligned} \mathbb{P}\bigl( l(\hat{\beta }_{n};T,z,\Delta )-l \bigl(\beta ^{*};T,z,\Delta \bigr)\bigr) \ge \frac{c_{l}}{2} \mathrm{E}^{*}{\bigl[{z_{i}^{*{\tau }}(t)}\bigl( \hat{ \beta }_{n}- {\beta ^{*}} \bigr)\bigr]^{2}} \end{aligned}$$
(5.5)

with \(c_{l}>0\) is a constant given in (H.4).

Proof

By the second order Taylor’s expansion of the function \(\beta \mapsto l(\beta ;T,z,\Delta )\), let the corresponding mean value \(\tilde{\beta }= {({{\tilde{\beta }}_{1}}, \ldots ,{{\tilde{\beta }}_{p}})^{T}}\) be between \(\beta _{j}^{*}\) and \(\beta _{j}\) for each \(j=1,2,\ldots ,p\).

Let \(z^{\tau }(t)\tilde{\beta }\) be the intermediate point between \(z^{\tau }(t)\beta ^{*}\) and \(z^{\tau }(t)\hat{\beta }_{n}\) given by a second order Taylor’s expansion of \(l(\beta ^{*};T,z,\Delta )\). Then, conditioning on \(\Delta =1\), we have

$$\begin{aligned} &\mathbb{P}\bigl( l(\hat{\beta }_{n};T,z,\Delta )-l\bigl(\beta ^{*};T,z, \Delta \bigr)\bigr) \\ &\quad ={{\mathrm{{E}}}^{*}}\bigl[{\mathrm{{E}}}\bigl\{ l(\beta ;T,z, \Delta )-l\bigl(\beta ^{*};T,z, \Delta \bigr)|{z_{i}^{{\tau }}(t)} \bigr\} \bigr]\big|_{\beta =\hat{\beta }_{n}} \\ &\quad ={{\mathrm{{E}}}^{*}} {\mathrm{{E}}}\bigl[\bigl\{ l\bigl(\beta ;T,{z^{*}},\Delta \bigr)-l\bigl(\beta ^{*};T,{z^{*}}, \Delta \bigr)|{z_{i}^{*{\tau }}(t)}\bigr\} \bigr]\big|_{\beta =\hat{\beta }_{n}} \\ &\quad ={{\mathrm{{E}}}^{*}} {\mathrm{{E}}}\biggl\{ [\bigl(\beta -\beta ^{*}\bigr)^{\tau }\dot{l}\bigl(\beta ^{*} ,z^{*},\Delta \bigr)+\frac{1}{2}\bigl(\beta -\beta ^{*}\bigr)^{\tau }\ddot{l}\bigl( \tilde{\beta },z^{*}, \Delta \bigr) \bigl(\beta -\beta ^{*}\bigr)\biggr\} \bigg|_{\beta =\hat{\beta }_{n}}, \\ &\qquad \tilde{\beta }\in {{\mathcal{S}}_{M}}\bigl(\beta ^{*}\bigr) \\ &\quad =\bigl\{ \bigl(\beta -\beta ^{*}\bigr)^{\tau }{{ \mathrm{{E}}}^{*}} {\mathrm{{E}}}\bigl[\dot{l}\bigl(\beta ^{*} ,z^{*}, \Delta \bigr)\bigr]\bigr\} \big|_{\beta =\hat{\beta }_{n}}+\frac{1}{2} \bigl(\beta -\beta ^{*}\bigr)^{\tau }{{\mathrm{{E}}}^{*}} {{\mathrm{{E}}}}\bigl\{ \ddot{l}\bigl(\tilde{\beta },z^{*},\Delta \bigr) \bigr\} \bigl( \beta -\beta ^{*}\bigr)\big|_{\beta =\hat{\beta }_{n}} \\ &\quad \bigl[\text{By (H.4)}\bigr] \\ &\quad \ge \frac{c_{l}}{2}{{\mathrm{{E}}}^{*}} {\mathrm{{E}}}\bigl\{ \bigl[{z^{* \tau }}(t) \bigl(\hat{\beta }_{n}-\beta ^{*}\bigr)\bigr]^{2}\bigr\} =\frac{c_{l}}{2} \mathrm{E}^{*}{\bigl[{z_{i}^{*{\tau }}(t)}\bigl( \hat{ \beta }_{n}- {\beta ^{*}} \bigr)\bigr]^{2}}, \end{aligned}$$
(5.6)

where the second last equality is obtained by estimating the equation in (3.4). □

From Proposition 5.1 and (5.4), it deduced that

$$ \lambda \sum_{g=1}^{G_{n}}{w_{g}} \bigl\Vert \hat{\beta }_{n}^{g}-{\beta ^{*}}^{g} \bigr\Vert _{2}+\frac{c_{l}}{2} \mathrm{E}^{*}{\bigl[{z^{*{\tau }}(t)}\bigl( \hat{\beta }_{n}- {\beta ^{*}} \bigr)\bigr]^{2}}\le 2\lambda \sum_{g\in H^{*}}{w_{g}} \bigl\Vert \hat{\beta }_{n}^{g}-{\beta ^{*}}^{g} \bigr\Vert _{2}. $$
(5.7)

Step 3: Squeeze error bounds from group stabil condition

Let Σ be the \(p \times p\) covariance matrix whose entries are \(\mathrm{E}[{z_{j}(t)}{z_{k}(t)}]=\mathrm{E}^{*}[{z_{j}(t)}{z_{k}(t)}]\). We have

$$ \mathrm{E}^{*}{\bigl[{z^{*{\tau }}(t)}\bigl( \hat{\beta }_{n}- {\beta ^{*}} \bigr)\bigr]^{2}}=\bigl( \hat{ \beta }_{n}-{\beta ^{*}}\bigr)^{\tau } \mathrm{E}^{*}\bigl[{z(t)} {z^{{\tau }}(t)}\bigr]\bigl( \hat{\beta }_{n}-{\beta ^{*}}\bigr)=\bigl(\hat{\beta }_{n}-{ \beta ^{*}}\bigr)^{\tau } \Sigma \bigl(\hat{\beta }_{n}-{\beta ^{*}}\bigr) $$

since we assume that \(\Sigma :=\mathrm{E}[{z(t)}{z^{{\tau }}(t)}]\) satisfies the group stabil condition \(GS(1,\varepsilon _{n},k,H^{*})\) after \(\hat{\beta }_{n}-\beta ^{*} \in S(1,H^{*})\) is verified. Multiplying \(c_{l}/2\) in (4.14), we have

$$ \frac{c_{l}}{2}\bigl(\hat{\beta }_{n}-{\beta ^{*}} \bigr)^{T}\Sigma \bigl(\hat{\beta }_{n}-{\beta ^{*}}\bigr)\ge \frac{kc_{l}}{2}\sum_{g \in H^{*}}{w_{g}} \bigl\Vert \hat{\beta }_{n}^{g}-{\beta ^{*}}^{g} \bigr\Vert _{2}^{2}- \frac{c_{l}\varepsilon _{n}}{2}. $$

Then substitute the above inequality to (5.7), by using the Cauchy–Schwarz inequality, we get

$$ \begin{aligned} &\lambda \sum_{g=1}^{G_{n}}{w_{g}} \bigl\Vert \hat{\beta }_{n}^{g}-{\beta ^{*}}^{g} \bigr\Vert _{2}+\frac{kc_{l}}{2}\sum_{g \in H^{*}}{w_{g}} \bigl\Vert \hat{\beta }_{n}^{g}-{\beta ^{*}}^{g} \bigr\Vert _{2}^{2} \\ &\quad \le 2 \lambda \sqrt{\sum_{g \in H^{*}}d_{g}}\sqrt {\sum_{g \in H^{*}}{w_{g}} \bigl\Vert \hat{\beta }_{n}^{g}-{\beta ^{*}}^{g} \bigr\Vert _{2}^{2}}+ \frac{c_{l}\varepsilon _{n}}{2}. \end{aligned} $$

Now the fact that \(2xy\le tx^{2}+y^{2}/t\) for all \(t>0\) leads to the following inequality:

$$ \begin{aligned}[b] &\lambda \sum_{g=1}^{G_{n}}{w_{g}} \bigl\Vert \hat{\beta }_{n}^{g}-{\beta ^{*}}^{g} \bigr\Vert _{2}+\frac{kc_{l}}{2}\sum_{g \in H^{*}}{w_{g}} \bigl\Vert \hat{\beta }_{n}^{g}-{\beta ^{*}}^{g} \bigr\Vert _{2}^{2} \\ &\quad \le 4t \lambda ^{2} \gamma ^{*}+\frac{1}{t}\sum _{g \in H^{*}}{w_{g}} \bigl\Vert \hat{\beta }_{n}^{g}-{\beta ^{*}}^{g} \bigr\Vert _{2}^{2}+\frac{c_{l}\varepsilon _{n}}{2}. \end{aligned} $$
(5.8)

Putting \(t:=\frac{2}{kc_{l}}\) in (5.8), we have the oracle inequality

$$ \sum_{g=1}^{G_{n}}{w_{g}} \bigl\Vert \hat{\beta }_{n}^{g}-{\beta ^{*}}^{g} \bigr\Vert _{2}\le \frac{8\gamma ^{*}\lambda }{kc_{l}}+ \frac{c_{l}\varepsilon _{n}}{2\lambda }. $$

Finally, for the prediction oracle inequality, it is deduced from (5.7) that

$$ \begin{aligned}[b] &\lambda \sum_{g=1}^{G_{n}}{w_{g}} \bigl\Vert \hat{\beta }_{n}^{g}-{\beta ^{*}}^{g} \bigr\Vert _{2}+\frac{c_{l}}{2} \mathrm{E}^{*}{\bigl[{z^{*{\tau }}(t)}\bigl( \hat{\beta }_{n}- {\beta ^{*}} \bigr)\bigr]^{2}} \\ &\quad \le 2\lambda \Biggl( {\sum_{g = 1}^{{G_{n}}} {w_{g}} {{ \bigl\Vert {\hat{\beta }_{n}^{g} - {\beta ^{*}}^{g}} \bigr\Vert }_{2}} - \sum _{g \notin {H^{*}}} {w_{g}} {{ \bigl\Vert {\hat{\beta }_{n}^{g} - {\beta ^{*}}^{g}} \bigr\Vert }_{2}}}\Biggr). \end{aligned} $$
(5.9)

Therefore,

$$ \frac{c_{l}}{2}\mathrm{E}^{*}{\bigl[{z^{*{\tau }}(t)}\bigl( \hat{\beta }_{n}- {\beta ^{*}} \bigr)\bigr]^{2}} \le 2\lambda \sum_{g = 1}^{{G_{n}}} {w_{g}} {{ \bigl\Vert {\hat{\beta }_{n}^{g} - {\beta ^{*}}^{g}} \bigr\Vert }_{2}}. $$

Note that the term \(\sum_{g \notin {H^{*}}} {w_{g}} {\| {\hat{\beta }_{n}^{g} - {\beta ^{*}}^{g}} \|_{2}} = \sum_{g \notin {H^{*}}} {w_{g}} {\| {\hat{\beta }_{n}^{g}} \|_{2}}\) that we have discarded for the first inequality sign in the above expression is very small on the set \(\{ g:{\beta ^{*}}^{g} = 0\} \).

Then using oracle inequality for \(\sum_{g=1}^{G_{n}}{w_{g}}\Vert \hat{\beta }_{n}^{g}-{\beta ^{*}}^{g} \Vert _{2}\) leads to

$$\begin{aligned} \frac{c_{l}}{2}\mathrm{E}^{*}{\bigl[{z^{*{\tau }}(t)} \bigl( \hat{\beta }_{n}- {\beta ^{*}} \bigr) \bigr]^{2}} & \le 2\lambda \sum_{g = 1}^{{G_{n}}} {w_{g}} {{ \bigl\Vert {\hat{\beta }_{n}^{g} - { \beta ^{*}}^{g}} \bigr\Vert }_{2}} =2\lambda \biggl( {\frac{{8{\gamma ^{*}}\lambda }}{{k{c_{l}}}} + \frac{{{c_{l}}{\varepsilon _{n}}}}{{2\lambda }}} \biggr) = \frac{{16{\gamma ^{*}}{\lambda ^{2}}}}{{k{c_{l}}}} + {c_{l}} {\varepsilon _{n}}. \end{aligned}$$

Finally we conclude the proof by using Propositions 4.1 and 4.2. We show that the desired oracle inequalities hold with high probability under the event \({{\mathcal{A}}}\cap {{\mathcal{B}}}\).

5.2 Proofs of the propositions

5.2.1 Proof of Proposition 4.1

First we show that the summation is satisfied by applying Hoeffding’s inequality, see Wainwright [26].

Lemma 5.1

(Hoeffding’s inequality)

Let \({X_{1}}, \ldots ,{X_{n}}\) be independent random variables on \(\mathbb{R}\) satisfying bound condition \({a_{i}}\le {{X_{i}}} \le {b_{i}}\) for \(i = 1,2, \ldots ,n \). Then we have

$$ P\Biggl(\Biggl|\sum_{i = 1}^{n} ({{X_{i}}}-{\mathrm{{E}}} {X_{i}}) \Biggr| \ge t\Biggr) \le 2 \exp \biggl\{ \frac{{ - 2{t^{2}}}}{{\sum_{i = 1}^{n} {(b_{i}-a_{i})^{2}} }} \biggr\} . $$

For \({{\mathcal{A}}_{1}} = \bigcap_{g = 1}^{{G_{n}}} \{ {{ \Vert {\frac{1}{n}\sum_{i = 1}^{n} {[ \frac{{z_{ig}^{\tau }({T_{i}})}}{{{w_{g}}}}{\Delta _{i}} - {\mathrm{{E}}}( \frac{{z_{ig}^{\tau }({T_{i}})}}{{{w_{g}}}}{\Delta _{i}})]} } \Vert _{2}} \le {\lambda _{a1}}} \} \), let \(W_{i}^{g}:= \frac{{z_{ig}^{\tau }({T_{i}})}}{{{w_{g}}}}{\Delta _{i}}- \mathrm{E}(\frac{{z_{ig}^{\tau }({T_{i}})}}{{{w_{g}}}}{\Delta _{i}})\) and

$$ W_{ij}^{g}:= \frac{{z_{ij}^{\tau }({T_{i}})}}{{{w_{g}}}}{\Delta _{i}}- \mathrm{E}\biggl(\frac{{z_{ij}^{\tau }({T_{i}})}}{{{w_{g}}}}{\Delta _{i}}\biggr),\quad j=1, \ldots,d_{g};i=1,\ldots,n. $$

We have

$$ {P}\bigl({\mathcal{A}}_{1}^{c}\bigr)\le \sum_{g=1}^{G_{n}}{P} \Biggl\lbrace \Biggl\Vert \frac{1}{n}\sum_{i=1}^{n}W_{i}^{g} \Biggr\Vert _{2}^{2}> {\lambda _{a1}^{2}} \Biggr\rbrace \le \sum_{g=1}^{G_{n}}\sum _{j=1}^{d_{g}}{P} \Biggl\lbrace \Biggl\vert \frac{1}{n}\sum_{i=1}^{n} W_{ij}^{g} \Biggr\vert >\frac{{\lambda _{a1}}}{{\sqrt{{d_{g}}} }} \Biggr\rbrace $$
(5.10)

due to \(\lbrace \Vert \frac{1}{n}\sum_{i=1}^{n}W_{i}^{g} \Vert _{2}^{2}> {\lambda _{a1}^{2}} \rbrace \subset \bigcup_{j \in \mathrm{Group}_{g}, \vert \mathrm{Group}_{g} \vert = {d_{g}}} \{ {|\frac{1}{n} {\sum_{i = 1}^{n} {W_{ij}^{g}} } |^{2} > \frac{\lambda _{a1}^{2}}{ {{d_{g}}} }} \} \).

Applying Hoeffding’s inequality with \({a_{i}}=\frac{-L}{n{w_{\min }}}\le \frac{1}{n} \frac{{z_{ij}^{\tau }({T_{i}})}}{{{w_{g}}}}{\Delta _{i}} \le \frac{L}{n{w_{\min }}}= {b_{i}}\), we obtain

$$\begin{aligned} P \Biggl\{ { \Biggl\vert {\frac{1}{n}\sum _{i = 1}^{n} {W_{ij}^{g}} } \Biggr\vert > \frac{{{\lambda _{a1}}}}{{\sqrt{{d_{g}}} }}} \Biggr\} \le 2\exp \biggl( { - \frac{{nw_{\min }^{2}\lambda _{a1}^{2}}}{{2{L^{2}}{d_{g}}}}} \biggr) \le 2\exp \biggl( { - \frac{{nw_{\min }^{2}\lambda _{a1}^{2}}}{{2{L^{2}}{d_{\max }}}}} \biggr). \end{aligned}$$
(5.11)

Finally, from (5.10) and (5.11), it is deduced that

$$\begin{aligned} {P}\bigl({\mathcal{A}}_{1}^{c}\bigr)\le 2d_{\mathrm{max}}G_{n}\exp \biggl( { - \frac{{nw_{\min }^{2}\lambda _{a1}^{2}}}{{2{L^{2}}{d_{\max }}}}} \biggr)=: d_{\mathrm{max}}(2G_{n})^{1-A^{2}}, \end{aligned}$$
(5.12)

which gives \(\lambda _{a1} = \frac{{L\sqrt{2{d_{\max }}} }}{{w_{\min }}}\sqrt{\frac{{\log (2{G_{n}})}}{n}}\).

For \({\mathcal{A}}^{\prime }_{2}\), we resort to McDiarmid’s concentration inequalities with bounded difference condition for random vectors, see Wainwright [26].

Lemma 5.2

Suppose that \(X_{1},\ldots ,X_{n}\) are independent random vectors all taking values in the set A, and assume that \(f:A^{n}\rightarrow \mathbb{R}\) is a function satisfying the bounded difference condition

$$ \sup_{x_{1},\ldots ,x_{n},x_{k}^{\prime }\in A} \bigl\vert f(x_{1}, \ldots ,x_{n})-f\bigl(x_{1},\ldots ,x_{k-1},x_{k}^{\prime },x_{k+1}, \ldots ,x_{n}\bigr) \bigr\vert \le c_{k}. $$

Then, for all \(t>0\),

$$ P \bigl[ { \bigl\vert {f({X_{1}},\ldots ,{X_{n}}) - { \mathrm{{E}}} \bigl\{ f({X_{1}}, \ldots ,{X_{n}}) \bigr\} } \bigr\vert \ge t} \bigr] \le 2\exp \Biggl( 2{t^{2}} \Big/\sum _{i = 1}^{n} {c_{i}^{2}} \Biggr). $$

If there are no absolute signs in the above event, then the upper bound is changed by \(\exp ( 2{t^{2}}/\sum_{i = 1}^{n} {c_{i}^{2}} )\).

Similar to the treatment of \({\mathcal{A}}_{1}\), let

$$ Z_{i}^{g}(\beta ): = \frac{{\int {z_{ig}^{\tau }(s)1(s \ge {T_{i}}){e^{z_{i}^{\tau }(T_{i})\beta }}\,dF(s,z)} }}{{{w_{g}}\int {1(s \ge {T_{i}}){e^{z_{i}^{\tau }(T_{i})\beta }}\,dF(s,z)} }} - {\mathrm{{E}}} \biggl( \frac{{\int {z_{ig}^{\tau }(s)1(s \ge {T_{i}}){e^{z_{i}^{\tau }(T_{i})\beta }}\,dF(s,z)} }}{{{w_{g}}\int {1(s \ge {T_{i}}){e^{z_{i}^{\tau }(T_{i})\beta }}\,dF(s,z)} }} \biggr) $$

and

$$ \begin{aligned} Z_{ij}^{g}(\beta )&:= \frac{{\int {z_{ij}^{\tau }(s)1(s \ge {T_{i}}){e^{z_{i}^{\tau }(T_{i})\beta }}\,dF(s,z)} }}{{{w_{g}}\int {1(s \ge {T_{i}}){e^{z_{i}^{\tau }(T_{i})\beta }}\,dF(s,z)} }} - {\mathrm{{E}}} \biggl( \frac{{\int {z_{ij}^{\tau }(s)1(s \ge {T_{i}}){e^{z_{i}^{\tau }(T_{i})\beta }}\,dF(s,z)} }}{{{w_{g}}\int {1(s \ge {T_{i}}){e^{z_{i}^{\tau }(T_{i})\beta }}\,dF(s,z)} }} \biggr),\\ &\quad j=1,\ldots,d_{g};i=1,\ldots,n. \end{aligned} $$

Then \({{\mathcal{A}}_{2}}: = \bigcap_{g = 1}^{{G_{n}}} {\{ {{\| {\frac{1}{n}\sum_{i = 1}^{n} {Z_{i}^{g}} } \|_{2}} \le {\lambda _{a2}}} \}} \). We have

$$\begin{aligned} {P}\bigl({\mathcal{A}}_{2}^{c}\bigr)&\le \sum _{g=1}^{G_{n}}{P} \Biggl\lbrace \Biggl\Vert \frac{1}{n}\sum_{i=1}^{n}Z_{i}^{g}( \beta ) \Biggr\Vert _{2}^{2}> {\lambda _{a1}^{2}} \Biggr\rbrace \le \sum_{g=1}^{G_{n}}\sum _{j=1}^{d_{g}}{P} \Biggl\lbrace \Biggl\vert \frac{1}{n}\sum_{i=1}^{n} Z_{ij}^{g}( \beta ) \Biggr\vert >\frac{{\lambda _{a2}}}{{\sqrt{{d_{g}}} }} \Biggr\rbrace \\ & \le \sum_{g=1}^{G_{n}}\sum _{j=1}^{d_{g}}{P} \Biggl\lbrace \sup _{\beta \in {{\mathcal{S}}_{M}}(\beta ^{*})} \Biggl\vert \frac{1}{n}\sum _{i=1}^{n} Z_{ij}^{g}(\beta ) \Biggr\vert > \frac{{\lambda _{a2}}}{{\sqrt{{d_{\max }}} }} \Biggr\rbrace . \end{aligned}$$
(5.13)

Let

$$ \begin{aligned} f({z_{1}},\ldots ,{z_{n}})&=\sup_{\beta \in {S_{M}}({\beta ^{*}})} \Biggl\vert \frac{1}{n}\sum_{i = 1}^{n} \biggl\{ \frac{{\int {z_{ij}^{\tau }(T_{i})1(s \ge {T_{i}}){e^{z_{i}^{\tau }(T_{i})\beta }}\,dF(s,z)} }}{{{w_{g}}\int {1(s \ge {T_{i}}){e^{z_{i}^{\tau }(s)\beta }}\,dF(s,z)} }} \\ &\quad {}- {\mathrm{{E}}} \biggl( {\frac{{\int {z_{ij}^{\tau }(T_{i})1(s \ge {T_{i}}){e^{z_{i}^{\tau }(s)\beta }}\,dF(s,z)} }}{{{w_{g}}\int {1(s \ge {T_{i}}){e^{z_{i}^{\tau }(T_{i})\beta }}\,dF(s,z)} }}} \biggr)\biggr\} \Biggr\vert \end{aligned} $$

and

$$\begin{aligned} & f(z_{1},\ldots ,z_{k-1},\tilde{z}_{k},z_{k+1}, \ldots ,z_{n}) \\ &\quad =\sup_{\beta \in {S_{M}}({\beta ^{*}})} \Biggl\vert \frac{1}{n}\sum _{i = 1,i \ne k}^{n} \biggl\{ \frac{{\int {z_{ij}^{\tau }(s)1(s \ge {T_{i}}){e^{z_{i}^{\tau }(T_{i})\beta }}\,dF(s,z)} }}{{{w_{g}}\int {1(s \ge {T_{i}}){e^{z_{i}^{\tau }(T_{i})\beta }}\,dF(s,z)} }} \\ &\qquad {} - { \mathrm{{E}}} \biggl( {\frac{{\int {z_{ij}^{\tau }(s)1(s \ge {T_{i}}){e^{z_{i}^{\tau }(T_{i})\beta }}\,dF(s,z)} }}{{{w_{g}}\int {1(s \ge {T_{i}}){e^{z_{i}^{\tau }(T_{i})\beta }}\,dF(s,z)} }}} \biggr)\biggr\} \\ & \qquad {} + \frac{1}{n}\biggl\{ \frac{{\int {z_{kj}^{\tau }(s)1(s \ge {T_{k}}){e^{z_{k}^{\tau }(T_{k})\beta }}\,dF(s,z)} }}{{{w_{g}}\int {1(s \ge {T_{k}}){e^{z_{k}^{\tau }(T_{k})\beta }}\,dF(s,z)} }} \\ &\qquad {}- {\mathrm{{E}}} \biggl( {\frac{{\int {z_{kj}^{\tau }(s)1(s \ge {T_{k}}){e^{z_{k}^{\tau }(T_{k})\beta }}\,dF(s,z)} }}{{{w_{g}}\int {1(s \ge {T_{k}}){e^{z_{k}^{\tau }(T_{k})\beta }}\,dF(s,z)} }}} \biggr)\biggr\} \Biggr\vert . \end{aligned}$$
(5.14)

Then we have

$$\begin{aligned} & f(z_{1},\ldots ,z_{n})-f(z_{1}, \ldots ,z_{k-1},\tilde{z}_{k},z_{k+1}, \ldots ,z_{n}) \end{aligned}$$
(5.15)
$$\begin{aligned} & \quad \le \sup_{\beta \in {S_{M}}({\beta ^{*}})} \biggl\vert \frac{1}{n}\biggl\{ \frac{{\int {z_{kj}^{\tau }(s)1(s \ge {T_{k}}){e^{z_{k}^{\tau }(T_{k})\beta }}\,dF(s,z)} }}{{{w_{g}}\int {1(s \ge {T_{k}}){e^{z_{k}^{\tau }(T_{k})\beta }}\,dF(s,z)} }} \\ &\qquad {}- {\mathrm{{E}}} \biggl( {\frac{{\int {z_{kj}^{\tau }(s)1(s \ge {T_{k}}){e^{z_{k}^{\tau }(T_{k})\beta }}\,dF(s,z)} }}{{{w_{g}}\int {1(s \ge {T_{k}}){e^{z_{k}^{\tau }(T_{k})\beta }}\,dF(s,z)} }}} \biggr)\biggr\} \\ & \qquad {} - \frac{1}{n}\biggl\{ \frac{{\int {\tilde{z}_{kj}^{\tau }(s)1(s \ge {T_{k}}){e^{\tilde{z}_{k}^{\tau }({T_{k}})\beta }}\,dF(s,z)} }}{{{w_{g}}\int {1(s \ge {T_{k}}){e^{\tilde{z}_{k}^{\tau }(T_{k})\beta }}\,dF(s,z)} }} \\ &\qquad {}- {\mathrm{{E}}} \biggl( {\frac{{\int {\tilde{z}_{kj}^{\tau }(s)1(s \ge {T_{k}}){e^{\tilde{z}_{k}^{\tau }(T_{k})\beta }}\,dF(s,z)} }}{{{w_{g}}\int {1(s \ge {T_{k}}){e^{\tilde{z}_{k}^{\tau }(T_{k})\beta }}\,dF(s,z)} }}} \biggr)\biggr\} \biggr\vert . \end{aligned}$$
(5.16)

Note that, for \(j=1,\ldots,d_{g}\) and \(i=1,\ldots,n\), we have

$$\begin{aligned} - \frac{{L{e^{2LB}}}}{{{w_{\min }}}} &= - \frac{{\int {L1(s \ge {T_{i}}){e^{LB}}\,dF(s,z)} }}{{{w_{\min }}\int {1(s \ge {T_{i}}){e^{ - LB}}F(s,z)} }}\le \frac{{\int {z_{ij}^{\tau }(s)1(s \ge {T_{i}}){e^{z_{i}^{\tau }(s)\beta }}\,dF(s,z)} }}{{{w_{g}}\int {1(s \ge {T_{i}}){e^{z_{i}^{\tau }(s)\beta }}\,dF(s,z)} }} \\ & \le \frac{{\int {L1(s \ge {T_{i}}){e^{LB}}\,dF(s,z)} }}{{{w_{\min }}\int {1(s \ge {T_{i}}){e^{ - LB}}\,dF(s,z)} }} = \frac{{L{e^{2LB}}}}{{{w_{\min }}}}. \end{aligned}$$
(5.17)

For fixed j, (5.15) gives

$$ \bigl\vert f(z_{1},\ldots ,z_{n})-f(z_{1}, \ldots ,z_{k-1},\tilde{z}_{k},z_{k+1}, \ldots ,x_{n}) \bigr\vert \le \frac{{4L{e^{2LB}}}}{{{nw_{\min }}}} $$

for all \({z_{1},\ldots ,z_{n},\tilde{z}_{k}}\).

Lemma 5.2 implies

$$ P \Biggl\{ {\sup_{\beta \in {S_{M}}({\beta ^{*}})} \Biggl\vert {\frac{1}{n}\sum _{i = 1}^{n} {Z_{ij}^{g}} (\beta )} \Biggr\vert \ge {\mathrm{{E}}} \Biggl( {\sup_{\beta \in {S_{M}}({\beta ^{*}})} \Biggl\vert {\frac{1}{n}\sum_{i = 1}^{n} {Z_{ij}^{g}} (\beta )} \Biggr\vert } \Biggr) + t} \Biggr\} \le \exp \biggl( { - \frac{{nt^{2}w_{\min }^{2}}}{{8{L^{2}}{e^{4LB}}}}} \biggr). $$

It is sufficient to estimate the sharper upper bounds of \({\mathrm{{E}}} ( {\sup_{\beta \in {S_{M}}({\beta ^{*}})} \vert {\frac{1}{n}\sum_{i = 1}^{n} {Z_{ij}^{g}} (\beta )} \vert } )\) by the symmetrization theorem and the contraction theorem below, which can be found in van der Vaart and Wellner [25], Wainwright [26].

Lemma 5.3

(Symmetrization theorem)

Let \(\varepsilon _{1},\ldots,\varepsilon _{n}\) be a Rademacher sequence with uniform distribution on \(\{ - 1,1\}\), independent of \(X_{1},\ldots,X_{n}\) and \(f\in \mathcal{F}\). Then we have

$$ {\mathrm{{E}}} \Biggl[ \sup_{f \in \mathcal{F}} \Biggl\lvert \sum _{i=1}^{n} \bigl[ f(X_{i})-{\mathrm{{E}}} \bigl\{ f(X_{i}) \bigr\} \bigr] \Biggr\rvert \Biggr]\le 2{ \mathrm{{E}}} \Biggl[{\mathrm{{E}}}_{\epsilon } \Biggl\{ \sup _{f \in \mathcal{F}} \Biggl\lvert \sum_{i=1}^{n} \epsilon _{i}f(X_{i}) \Biggr\rvert \Biggr\} \Biggr], $$

where \({\mathrm{{E}}}[\cdot ]\) refers to the expectation w.r.t. \(X_{1},\ldots,X_{n}\) and \({\mathrm{{E}}}_{\epsilon } \{ \cdot \} \) w.r.t. \(\epsilon _{1},\ldots,\epsilon _{n}\).

Using the symmetrization theorem, we have

$$\begin{aligned} & {\mathrm{{E}}} \Biggl[\sup_{\beta \in {S_{M}}({\beta ^{*}})} \Biggl\vert \frac{1}{n}\sum_{i = 1}^{n} \biggl\{ \frac{{\int {z_{ij}(s)1(s \ge {T_{i}}){e^{z_{i}^{\tau }({T_{i}})\beta }}\,dF(s,z)} }}{{{w_{g}}\int {1(s \ge {T_{i}}){e^{z_{i}^{\tau }(T_{i})\beta }}\,dF(s,z)} }} \\ &\qquad {}- {\mathrm{{E}}} \biggl( {\frac{{\int {z_{ij}^{\tau }(s)1(s \ge {T_{i}}){e^{z_{i}^{\tau }(T_{i})\beta }}\,dF(s,z)} }}{{{w_{g}}\int {1(s \ge {T_{i}}){e^{z_{i}^{\tau }(T_{i})\beta }}\,dF(s,z)} }}} \biggr)\biggr\} \Biggr\vert \Biggr] \\ &\quad \le 2{\mathrm{{E}}} \Biggl[{\mathrm{{E}}}_{\epsilon } \Biggl\{ \sup _{\beta \in {S_{M}}({\beta ^{*}})} \Biggl\lvert \sum_{i=1}^{n} \frac{\epsilon _{i}{\int {z_{ij} (s)1(s \ge {T_{i}}){e^{z_{i}^{\tau }(T_{i})\beta }}\,dF(s,z)} }}{{n{w_{g}}\int {1(s \ge {T_{i}}){e^{z_{i}^{\tau }(T_{i})\beta }}\,dF(s,z)} }} \Biggr\rvert \Biggr\} \Biggr] \\ &\quad \le \frac{2}{n{w_{\min }}}{\mathrm{{E}}} \Biggl(\sup_{\beta \in {S_{M}}({\beta ^{*}})} \Biggl\vert {\sum_{i = 1}^{n} {{w_{i}}( \beta )}\epsilon _{i}} \Biggr\vert \Biggr), \end{aligned}$$
(5.18)

where \(w_{i}(\beta ):= \frac{{\int {z_{ij} (s)1(s \ge {T_{i}}){e^{z_{i}^{\tau }(T_{i})\beta }}\,dF(s,z)} }}{{\int {1(s \ge {T_{i}}){e^{z_{i}^{\tau }(T_{i})\beta }}\,dF(s,z)} }} \) for \(i=1,2,\ldots,n\).

For any \(w_{i}(\beta )\), we can find a sequence of random vectors \(\{a_{i}\}_{i=1}^{n}\in \mathbb{R}^{p}\) with \({ \Vert a_{i} \Vert _{\infty }}=1\) and vector \(b \in \mathbb{R}^{p}\) with \({{ \Vert b \Vert }_{1}} \le L\) such that

$$ -L \le w_{i}(\beta ) = {a_{i}^{T}b} \le {{{{ \Vert a_{i} \Vert }_{\infty }} {{ \Vert b \Vert }_{1}}}} = {{ \Vert b \Vert }_{1}} \le L. $$

Then we have

$$\begin{aligned}& \frac{2}{n{w_{\min }}}{\mathrm{{E}}} \Biggl(\sup_{\beta \in {S_{M}}({\beta ^{*}})} \Biggl\vert {\sum_{i = 1}^{n} {{w_{i}}( \beta )}\epsilon _{i}} \Biggr\vert \Biggr)\\& \quad \le \frac{2}{n{w_{\min }}}{\mathrm{{E}}} \Biggl( {\sup_{\beta \in {S_{M}}({\beta ^{*}})} \Biggl\vert \sum_{i = 1}^{n} \sum _{j = 1}^{p} \epsilon _{i}{{a_{ij}}} {b_{j}} \Biggr\vert } \Biggr) \\& \quad =\frac{2}{n{w_{\min }}}{\mathrm{{E}}} \Biggl( {\sup_{\beta \in {S_{M}}({\beta ^{*}})} \Biggl\vert \sum_{j = 1}^{p} \Biggl(\sum _{i = 1}^{n} \epsilon _{i}{{a_{ij}}} \Biggr) {b_{j}} \Biggr\vert } \Biggr) \\& \quad \bigl[\text{By H{\"{o}}lder's inequality}\bigr] \\& \quad \le \frac{2}{n{w_{\min }}} {\mathrm{{E}}} \Biggl(\sup_{{ \Vert b \Vert _{1}}\le L} \max _{1 \le j \le p} \Biggl\lvert \sum_{i=1}^{n} \epsilon _{i}{{a_{ij}}} \Biggr\rvert \cdot { \Vert b \Vert _{1}} \Biggr) \\& \quad \le \frac{{2L}}{n{w_{\min }}}{\mathrm{{E}}} \Biggl( \max_{1 \le j \le p} \Biggl\lvert \sum_{i=1}^{n} \epsilon _{i}{{a_{ij}}} \Biggr\rvert \Biggr)=\frac{2L}{n{w_{\min }}}{ \mathrm{{E}}} \Biggl({\mathrm{{E}}}_{\epsilon } \max_{1 \le j \le p} \Biggl\lvert \sum_{i=1}^{n} \epsilon _{i}{{a_{ij}}} \Biggr\rvert \Biggr). \end{aligned}$$

Next, we are going to use the following maximal inequality for bounded variables; see [31] for more discussions.

Lemma 5.4

(Maximal inequality)

Let \(X_{1},\ldots,X_{n}\) be independent random vectors that take values in a measurable space \(\mathcal{X}\) and \(f_{1},\ldots,f_{n}\) be real-valued functions in \(\mathcal{X}\) which satisfy, for all \(j=1,\ldots,p\) and all \(i=1,\ldots,n\),

$$ {\mathrm{{E}}}f_{j}(X_{i})=0, \qquad \bigl\vert f_{j}(X_{i}) \bigr\vert \le a_{ij}. $$

Then

$$ {\mathrm{{E}}} \Biggl( \max_{1\le j\le p} \Biggl\lvert \sum _{i=1}^{n}f_{j}(X_{i}) \Biggr\rvert \Biggr) \le \sqrt{2\log (2p)} \max_{1\le j\le p}\sqrt{ \sum_{i=1}^{n}a_{ij}^{2}}. $$

By Proposition 5.4, with \({\mathrm{{E}}}[{z_{i}}\epsilon _{i}{{a_{ij}}}]=0\) and \(\epsilon _{i}{{a_{ij}}} \le \max_{1 \le i \le n}{ \Vert a_{i} \Vert _{\infty }} =1\), we get

$$\begin{aligned} {\mathrm{{E}}} \Biggl( {\sup_{\beta \in {S_{M}}({\beta ^{*}})} \Biggl\vert { \frac{1}{n}\sum_{i = 1}^{n} {Z_{ij}^{g}} (\beta )} \Biggr\vert } \Biggr) &\le \frac{2L}{n{w_{\min }}}{\mathrm{{E}}} \Biggl( \max_{1 \le j \le p} \Biggl\lvert \sum_{i=1}^{n} {{\epsilon _{i}}} {{a_{ij}}} \Biggr\rvert \Biggr)\\ &\le \frac{2L}{n{w_{\min }}}\sqrt{2\log 2p}\sqrt{n}= \frac{{2L}}{{{w_{\min }}}}\sqrt{ \frac{{2\log 2p}}{n}}. \end{aligned}$$

Then

$$\begin{aligned} & P \Biggl\{ {\sup_{\beta \in {S_{M}}({\beta ^{*}})} \Biggl\vert {\frac{1}{n}\sum _{i = 1}^{n} {Z_{ij}^{g}} (\beta )} \Biggr\vert \ge \frac{{2L}}{{{w_{\min }}}}\sqrt{\frac{{2\log 2p}}{n}} + t} \Biggr\} \\ &\quad \le P \Biggl\{ {\sup_{\beta \in {S_{M}}({\beta ^{*}})} \Biggl\vert {\frac{1}{n} \sum_{i = 1}^{n} {Z_{ij}^{g}} (\beta )} \Biggr\vert \ge {\mathrm{{E}}} \Biggl( {\sup_{\beta \in {S_{M}}({\beta ^{*}})} \Biggl\vert {\frac{1}{n}\sum_{i = 1}^{n} {Z_{ij}^{g}} (\beta )} \Biggr\vert } \Biggr) + t} \Biggr\} \le \exp \biggl( { - \frac{{nt^{2}w_{\min }^{2}}}{{8{L^{2}}{e^{4LB}}}}} \biggr). \end{aligned}$$

Therefore, (5.13) can be further bounded by letting \(\frac{{\lambda _{a2}}}{{\sqrt{{d_{\max }}} }}= \frac{{2L}}{{{w_{\min }}}}\sqrt{\frac{{2\log 2p}}{n}} + t\)

$$\begin{aligned} {P}\bigl({\mathcal{A}}_{2}^{c}\bigr)&\le \sum _{g=1}^{G_{n}}\sum _{j=1}^{d_{g}}{P} \Biggl\lbrace \sup _{\beta \in {{\mathcal{S}}_{M}}( \beta ^{*})} \Biggl\vert \frac{1}{n}\sum _{i=1}^{n} Z_{ij}^{g}(\beta ) \Biggr\vert >\frac{{\lambda _{a2}}}{{\sqrt{{d_{g}}} }} \Biggr\rbrace \\ &\le 2d_{\mathrm{max}}G_{n} \exp \biggl( { - \frac{{nt^{2}w_{\min }^{2}}}{{8{L^{2}}{e^{4LB}}}}} \biggr). \end{aligned}$$
(5.19)

Let \(2d_{\mathrm{max}}G_{n}\exp ( { - \frac{{nt^{2}w_{\min }^{2}}}{{8{L^{2}}{e^{4LB}}}}} )= d_{\mathrm{max}}(2G_{n})^{1-A^{2}}\), which gives

$$ t = \frac{{2\sqrt{2} AL{e^{2LB}}}}{{w_{\min }}}\sqrt{\frac{{\log (2{G_{n}})}}{n}}. $$

Finally, we have

$$\begin{aligned} {P}\bigl({\mathcal{A}}_{2}^{c}\bigr)\le d_{\mathrm{max}}(2G_{n})^{1-A^{2}} \end{aligned}$$
(5.20)

by letting \({\lambda _{a2}} = \frac{{2L\sqrt{2{d_{\max }}} }}{{{w_{\min }}}} ( {\sqrt{\frac{{\log 2p}}{n} + } A{e^{2LB}}\sqrt{\frac{{\log (2{G_{n}})}}{n}} } )\). Together with (5.12), it gives

$$ P(\mathcal{A})=P({{\mathcal{A}}_{1}}\cap {{\mathcal{A}}_{2}}) \ge {P}({\mathcal{A}}_{1})+{P}({\mathcal{A}}_{2})-1\ge 1-2d_{\mathrm{max}}(2G_{n})^{1-A^{2}}. $$

Then (4.6) is obtained by using (4.5) conditioning on the event \({{\mathcal{A}}_{1}}\cap {{\mathcal{A}}_{2}}\).

5.2.2 Proof of Proposition 4.2

For the event \({{\mathcal{B}}_{0}}\), we need the exponential concentration inequality for the uniform convergence of empirical distribution function

$$ F_{n}(x)={\frac{1}{n}}\sum_{{i=1}}^{n}{1} {{\{X_{i}\leq x\}}}, \quad x \in {\mathbb{R}}. $$

Lemma 5.5

(DKW inequality, Massart [19])

For \({x\in {\mathbb{R}} }\), the DKW inequality bounds the probability that the random function \(F_{n}(x)\) differs from \(F(x)\) by more than a given constant \(\varepsilon > 0\):

$$ P{ \Bigl(}\sup_{x\in {\mathbb{R}} } \bigl\vert F_{n}(x)-F(x) \bigr\vert >\varepsilon { \Bigr)}\leq 2e^{-2n\varepsilon ^{2}}. $$

[8] proves the inequality with an unspecified multiplicative constant multiples of the exponent in the tail bounds. Massart [19] shows that the DKW inequality has the multiplying constant 2. Let \(p_{\tau }:={{P({T_{1}} \ge \tau )}}=2U{{\mathrm{{e}}}^{ LB}}\), so \(U = {p_{\tau }}{{\mathrm{{e}}}^{ - LB}}/2\). We have

$$\begin{aligned} P\bigl({{\mathcal{B}}_{0}^{c}}\bigr)&= P \Bigl(\sup_{ {t_{s}} \in [0,\tau ], {\beta \in {{\mathcal{S}}_{M}}(\beta ^{*})} } {B_{1n}}({t_{s}},\beta ) \le U \Bigr) \\ &\le P \Biggl(\sup_{ {t_{s}} \in [0,\tau ], {\beta \in {{\mathcal{S}}_{M}}(\beta ^{*})} }\frac{1}{n}\sum _{j = 1}^{n} {1({T_{j}} \ge {t_{s}}){{\mathrm{{e}}}^{z_{i}^{\tau }({t_{s}})\beta }}} \le U \Biggr) \\ & \le P \Biggl(\frac{1}{n}\sum_{j = 1}^{n} {1({T_{j}} \ge \tau ){{\mathrm{{e}}}^{-LB}}} \le U \Biggr) \\ &= P \Biggl(\frac{1}{n} \sum_{j = 1}^{n} {1({T_{j}} \ge \tau )} - \mathrm{E}\bigl[1({T_{1}} \ge \tau )\bigr] \le U{{\mathrm{{e}}}^{ LB}} - P({T_{1}} \ge \tau ) \Biggr) \\ & \le P \Biggl(\Biggl|\frac{1}{n}\sum_{j = 1}^{n} {1({T_{j}} \ge \tau )} - P({T_{1}} \ge \tau )\Biggr| \ge \frac{p_{\tau }}{2} \Biggr) \\ & \le P \Biggl(\sup_{\tau \in {\mathbb{R}}} \Biggl| \frac{1}{n}\sum _{j = 1}^{n} {1({T_{j}} \ge \tau )} - \mathrm{E}\bigl[1({T_{1}} \ge \tau )\bigr]\Biggr| \ge \frac{p_{\tau }}{2} \Biggr) \le 2{e^{ - np_{\tau }^{2}/2}}. \end{aligned}$$
(5.21)

Let \((\mathcal{F},\|\cdot \|)\) be a subset of a normed space of real functions \(f: \mathcal{X} \rightarrow \mathbb{R}\) in some set \(\mathcal{X}\). Define the \(L_{r}(Q)\)-norm by \(\|f\|_{L_{r}(Q)}= (\int |f|^{r} \,d Q )^{1 / r}\). For probability measures Q, we have \(L_{r}(Q)\)-spaces endowed by the \(L_{r}(Q)\)-norm. Given two functions \(l(\cdot )\) and \(u(\cdot )\), the bracket \([l, u]\) is the set of all functions \(f \in \mathcal{F}\) with \(l(x) \leq f(x) \leq u(x)\) for all \(x \in \mathcal{X} \). An ε-bracket is a bracket \([l, u]\) with \(\|l-u\|_{L_{r}(Q)}<\varepsilon \), see van der Vaart and Wellner [25]. The bracketing number \(N_{[\,]} ({\varepsilon }, \mathcal{F}, L_{r}(Q) )\) is the minimum number of ε-brackets covered by \(\mathcal{F}\), i.e.,

$$ N_{[\,]} \bigl({\varepsilon }, \mathcal{F}, L_{r}(Q) \bigr)= \inf \Biggl\{ n: \exists l_{1}, u_{1}, \ldots , l_{n}, u_{n} \text{ s.t. } \bigcup _{i=1}^{n} [l_{i}, u_{i} ]= \mathcal{F} \text{ and } \Vert l_{n}-u_{n} \Vert _{L_{r}(Q)} < \varepsilon \Biggr\} . $$

For the event \({{\mathcal{B}}_{1}}\), let \(B_{i}^{g}(\beta ):= {{1} ( {{T_{i}} \ge {t_{s}}} ) \frac{{z_{ig}({t_{s}})}{{\mathrm{{e}}}^{z_{i}^{\tau }({t_{s}})\beta }}}{{{w_{g}}}}} - {\mathrm{{E}}} [1(T \ge {t_{s}}) \frac{{{z_{ig}}(T)}{{\mathrm{{e}}}^{{z^{\tau }}(T) \beta }}}{{{w_{g}}}} ]\) and

$$ B_{ij}^{g}(\beta ):= {{1} ( {{T_{i}} \ge {t_{s}}} ) \frac{{z_{ig}({t_{s}})}{{\mathrm{{e}}}^{z_{i}^{\tau }({t_{s}})\beta }}}{{{w_{g}}}}} - {\mathrm{{E}}} \biggl[1(T \ge {t_{s}}) \frac{{{z_{ig}}(T)}{{\mathrm{{e}}}^{{z^{\tau }}(T) \beta }}}{{{w_{g}}}} \biggr],\quad j=1,\ldots,d_{g};i=1, \ldots,n. $$

Similar to the analysis of \({{\mathcal{A}}_{1}}\) and \({{\mathcal{A}}_{2}}\), we have

$$\begin{aligned} {P}\bigl({\mathcal{B}}_{1}^{c}\bigr)&\le \sum_{g=1}^{G_{n}}{P} \Biggl\lbrace \Biggl\Vert \frac{1}{n}\sum_{i=1}^{n}B_{i}^{g}( \beta ) \Biggr\Vert _{2}^{2}> {\lambda _{b1}^{2}}U^{2} \Biggr\rbrace \\ &\le \sum_{g=1}^{G_{n}}\sum _{j=1}^{d_{g}}{P} \Biggl\lbrace \Biggl\vert \frac{1}{n}\sum_{i=1}^{n}{{1} ( {{T_{i}} \ge {t_{s}}} ) \frac{{z_{ij}({t_{s}})}{{\mathrm{{e}}}^{z_{i}^{\tau }({t_{s}})\beta }}}{{{w_{g}}}}} \\ &\quad{}- { \mathrm{{E}}}\biggl[1(T \ge {t_{s}}) \frac{{{z_{ij}}({t_{s}})}{{\mathrm{{e}}}^{{z^{\tau }}({t_{s}}) \beta }}}{{{w_{g}}}}\biggr] \Biggr\vert >\frac{{\lambda _{b1}U}}{{\sqrt{{d_{g}}} }} \Biggr\rbrace \\ &\le \sum_{g=1}^{G_{n}}\sum _{j=1}^{d_{g}}{P} \Biggl\lbrace \sup _{\substack{ {t_{s}} \in [0,\tau ], \\ {\beta \in {{\mathcal{S}}_{M}}(\beta ^{*})} }} \Biggl\vert \frac{1}{n}\sum _{i=1}^{n}{{1} ( {{T_{i}} \ge {t_{s}}} ) \frac{{z_{ij}({t_{s}})}{{\mathrm{{e}}}^{z_{i}^{\tau }({t_{s}})\beta }}}{{{w_{g}}}}} \\ &\quad {}- {\mathrm{{E}}}\biggl[1(T \ge {t_{s}}) \frac{{{z_{ij}}({t_{s}})}{{\mathrm{{e}}}^{{z^{\tau }}({t_{s}}) \beta }}}{{{w_{g}}}}\biggr] \Biggr\vert > \frac{{\lambda _{b1}U}}{{\sqrt{{d_{\max }}} }} \Biggr\rbrace . \end{aligned}$$
(5.22)

Then we will apply sub-Gaussian concentration for suprema of the empirical processes as the following event:

$$\begin{aligned} {\mathcal{B}}_{1gj}&= \Biggl\lbrace \sup_{\substack{ {t_{s}} \in [0,\tau ], \\ {\beta \in {{\mathcal{S}}_{M}}(\beta ^{*})} }} \Biggl\vert \frac{1}{n}\sum_{i=1}^{n}{{1} ( {{T_{i}} \ge {t_{s}}} ) \frac{{z_{ij}({t_{s}})}{{\mathrm{{e}}}^{z_{i}^{\tau }({t_{s}})\beta }}}{{{w_{g}}}}} - { \mathrm{{E}}} \biggl[1(T \ge {t_{s}}) \frac{{{z_{ij}}(T)}{{\mathrm{{e}}}^{{z^{\tau }}(T) \beta }}}{{{w_{g}}}} \biggr] \Biggr\vert \le \frac{{\lambda _{b1}U}}{{\sqrt{{d_{\max }}} }} \Biggr\rbrace ,\\ &\quad j=1,\ldots,d_{g};g=1, \ldots,{G_{n}}, \end{aligned}$$

with bracketing numbers \(\{ N_{[\,]} ({\varepsilon }, {\mathcal{B}}_{1gj}, L_{2}(P) )\}\) relative to \(L_{2}(P)\)-norm, see Theorem 2.14.9 of van der Vaart and Wellner [25].

Lemma 5.6

(Sharper bounds for suprema of empirical processes, Talagrand [22])

Consider a probability space \((\Omega , \Sigma , P)\) and n i.i.d. random variables \(X_{1}, \ldots , X_{n}\), valued in Ω, of law P. Let \(\mathcal{F}\) be a class of measurable functions \(f: \mathcal{X} \mapsto [0,1]\) that satisfy

$$ N_{[\,]} \bigl(\varepsilon , \mathcal{F}, L_{2}(P) \bigr) \leq \biggl( \frac{K}{\varepsilon } \biggr)^{V} \quad \textit{for every } 0< \varepsilon < K. $$

Then, for every \(t>0\),

$$ P \Biggl(\sqrt{n} {\sup_{f \in {\mathcal{F}}} \Biggl\vert { \frac{1}{n}\sum_{i = 1}^{n} {f({X_{i}})} - {\mathrm{{E}}}f({X_{i}})} \Biggr\vert \ge t} \Biggr) \leq \biggl(\frac{D(K) t}{\sqrt{V}} \biggr)^{V} e^{-2 t^{2}} $$

for a constant \(D(K)\) that depends on K only.

The explicit constant \(D(K)\) can be found in Zhang [30], who studies the tail bounds for the supremums of the empirical process \(\{ n^{-1 / 2} \sum_{i=1}^{n} [f (X_{i} )- \mathrm{E} f (X_{i} ) ] \} \), where \(\{X_{i}\}\) is a sequence of (non-i.i.d, unbounded) independent random vectors with values in a general measurable space \((\mathcal{X}, \mathcal{A})\), and f is a measurable real function on \((\mathcal{X}, \mathcal{A})\).

In what follows, we assume that \(z(t)\) is non-random. For \(\{{\mathcal{B}}_{1gj}\}\) in (5.22), we have the function classes

$$ \begin{aligned} {\mathcal{F}}_{1gj} &= \biggl\{ f_{t,\beta }(x,z)= {{1} ( {x \ge t} ) \frac{[{{z_{1j}}(t){{\mathrm{{e}}}^{z^{\tau }({t})\beta }}}+L{{\mathrm{{e}}}^{LB}}]{w_{\min }}}{{2L{{\mathrm{{e}}}^{LB}}}{w_{g}}}:t \in [0,\tau ],\beta \in {\mathbb{R}^{p}}} \biggr\} ,\\ &\quad j=1,\ldots,d_{g};g=1,\ldots,{G_{n}}, \end{aligned} $$

so \(0\le f_{t,\beta }(x,z)\le 1\).

In \({\mathcal{B}}_{2}\), we focus on the class of functions \(0\le g_{t,\beta }(x,z)\le 1\),

$$ {\mathcal{G}}_{2} = \bigl\{ g_{t,\beta }(x,z)={{1} ( {x \ge t} ){ \mathrm{e} ^{z^{\tau }(t) \beta -LB }}:t \in [0, \tau ],\beta \in {\mathbb{R}^{p}}} \bigr\} . $$

Let \(\lceil x\rceil \) be the smallest integer that is greater than or equal to x. For any \(\epsilon \in (0,1)\), let \(t_{s}\) be the sth \(\lceil 1 / \varepsilon \rceil \) quantile of \(T_{1}\), thus

$$ P (T_{1} \leq t_{s} )=i \varepsilon , \quad s=1, \ldots ,{\lceil 1 / \varepsilon \rceil }-1, t_{0}=0,t_{\lceil 1 / \varepsilon \rceil }=\infty . $$

For \({\mathcal{F}}_{1gj}\) and \({\mathcal{G}}_{2}\), we consider two types of brackets of the forms

$$\begin{aligned} &\bigl[L_{jg,k}^{\mathcal{F}}(x,z),U_{jg,k}^{\mathcal{F}}(x,z) \bigr]\\ &\quad : = \biggl[1 ( {x \ge {s_{k}}} ) \frac{({{z_{j}}{{\mathrm{{e}}}^{z^{\tau }\beta }}}+L{{\mathrm{{e}}}^{LB}}){w_{\min }}}{{2L{{\mathrm{{e}}}^{LB}}}{w_{g}}},1 ( {x \ge {s_{k - 1}}} ) \frac{({{z_{j}}{{\mathrm{{e}}}^{z^{\tau }\beta }}}+L{{\mathrm{{e}}}^{LB}}){w_{\min }}}{{2L{{\mathrm{{e}}}^{LB}}}{w_{g}}} \biggr], \\ &\qquad j=1,\ldots,d_{g};g=1,\ldots,{G_{n}};z=(z_{1}, \ldots , z_{p})^{\tau }:=(L, \ldots , L)^{\tau }\in \mathbb{R}^{p} \end{aligned}$$

and

$$ \bigl[L_{k}^{\mathcal{G}}(x,z),U_{k}^{\mathcal{G}}(x,z) \bigr]: = \biggl[1 ( {x \ge {s_{k}}} )\frac{{{e^{z^{\tau }\beta }}}}{{{e^{LB}}}},1 ( {x \ge {s_{k - 1}}} )\frac{{{e^{z^{\tau }\beta }}}}{{{e^{LB}}}} \biggr] $$

for a grid of points \(-\infty =s_{0}< s_{1}<\cdots <s_{\lceil 1 / \varepsilon \rceil }= \infty \) with the property \(F (s_{k} )-F (s_{k-1} )<\varepsilon \) for all i.

Then, for given j and g, the bracket functions satisfy

$$\begin{aligned}& U_{jg,k}^{\mathcal{F}}(x,z) \le f_{s,\beta }(x,z)\le L_{jg,k}^{\mathcal{F}}(x,z),\quad k=0,1,2,\ldots \\& U_{k}^{\mathcal{G}}(x,z) \le g_{s,\beta }(x,z)\le L_{k}^{\mathcal{G}}(x,z),\quad k=0,1,2,\ldots \end{aligned}$$

provided \(s_{k-1}< s \leq s_{k}\).

For \(\{{\mathcal{B}}_{1gj}\}\), the \(L_{2}(P)\)-norm of \(U_{jg,k}^{\mathcal{F}}(x,z)-L_{jg,k}^{\mathcal{F}}(x,z)\) is

$$\begin{aligned} &\bigl\Vert U_{jg,k}^{\mathcal{F}}(x,z)-L_{jg,k}^{\mathcal{F}}(x,z) \bigr\Vert _{L_{2}(P)}\\ &\quad =\bigl\{ {\mathrm{{E}}}_{T} \bigl[U_{jg,k}^{\mathcal{F}}(T,z)-L_{jg,k}^{\mathcal{F}}(T,z) \bigr]^{2}\bigr\} ^{1 / 2} \\ &\quad \le \biggl\{ {\mathrm{{E}}}_{T} \biggl[ \frac{({{z_{j}}{{\mathrm{{e}}}^{z^{\tau }\beta }}}+L{{\mathrm{{e}}}^{LB}}){w_{\min }}}{{2L{{\mathrm{{e}}}^{LB}}}{w_{g}}} \bigl\{ 1 (T \geq s_{k} )-1 (T>s_{k-1} ) \bigr\} \biggr]^{2} \biggr\} ^{1 / 2} \\ &\quad \le \bigl\{ P (s_{k-1}< T\leq s_{k-1} ) \bigr\} ^{1 / 2}= \bigl\{ F (s_{k} )-F (s_{k-1} ) \bigr\} ^{1 / 2}< \sqrt{\varepsilon }. \end{aligned}$$

For \({\mathcal{B}}_{2}\), the \(L_{2}(P)\)-norm for \(U_{k}^{\mathcal{G}}(x,z)-L_{k}^{\mathcal{G}}(x,z)\) is

$$\begin{aligned} \bigl\Vert U_{k}^{\mathcal{G}}(x,z)-L_{k}^{\mathcal{G}}(x,z) \bigr\Vert _{L_{2}(P)}& =\bigl\{ {\mathrm{{E}}}_{T} \bigl[U_{k}^{\mathcal{G}}(T,z)-L_{k}^{\mathcal{G}}(T,z) \bigr]^{2}\bigr\} ^{1 / 2} \\ &\le \biggl\{ {\mathrm{{E}}}_{T} \biggl[ \frac{({{z_{j}}{{\mathrm{{e}}}^{z^{\tau }\beta }}}+L{{\mathrm{{e}}}^{LB}}){w_{\min }}}{{2L{{\mathrm{{e}}}^{LB}}}{w_{g}}} \bigl\{ 1 (T \geq s_{k} )-1 (T>s_{k-1} ) \bigr\} \biggr]^{2} \biggr\} ^{1 / 2} \\ & \le \bigl\{ P (s_{k-1}< T \leq s_{k-1} ) \bigr\} ^{1 / 2}= \bigl\{ F (s_{k} )-F (s_{k-1} ) \bigr\} ^{1 / 2}< \sqrt{\varepsilon }. \end{aligned}$$

In both cases, by the definition of bracketing number, we get

$$ N_{[\,]} \bigl(\sqrt{\varepsilon }, \mathcal{F}, L_{2}(P) \bigr)\le \lceil 1 / \varepsilon \rceil \le 2 / \varepsilon . $$

Hence, \(N_{[\,]} ({\varepsilon }, \mathcal{F}, L_{2}(P) )\le 2 / \varepsilon ^{2}\).

For the event \({{\mathcal{B}}_{1}}\) with relation (5.22), we get \(K=\sqrt{2}\) and \(V=2\) in Lemma 5.6. Then, conditioning on the random design z, with Lemma 5.6 we define

$$\begin{aligned} & P({\mathcal{B}}_{1gj})=P \Biggl\lbrace \Biggl\vert { \frac{1}{n}\sum_{i = 1}^{n} {{1} ( {{T_{i}} \ge {t_{s}}} ) \frac{{{z_{ij}}({t_{s}})}}{{{w_{g}}}}{{ \mathrm{{e}}}^{z_{i}^{\tau }({t_{s}}) \beta }}} - {{\mathrm{{E}}}_{T}}\biggl[1(T \ge {t_{s}}){z_{ij}} \frac{{{z_{ij}}({t_{s}})}}{{{w_{g}}}}{{ \mathrm{{e}}}^{z_{i}^{\tau }({t_{s}}) \beta }}\biggr]} \Biggr\vert \le \frac{{2L{e^{LB}}t}}{{{w_{\min }}}} \Biggr\rbrace \\ &\quad ={{\mathrm{{E}}}_{z}}P \Biggl\lbrace \sup_{\substack{ {t_{s}} \in [0,\tau ], \\ {\beta \in {{\mathcal{S}}_{M}}(\beta ^{*})} }} \Biggl\vert \frac{1}{n}\sum_{i = 1}^{n} {\frac{{{1} ( {{T_{i}} \ge {t_{s}}} ){z_{ij}}({t_{s}})[{{\mathrm{{e}}}^{z_{i}^{\tau }({t_{s}})\beta }} + L{e^{LB}}]{w_{\min }}}}{{2L{e^{LB}}{w_{g}}}}} \\ &\qquad {}- {{\mathrm{{E}}}_{T}}\biggl[ \frac{{1(T \ge {t_{s}}){z_{ij}}(T)[{{\mathrm{{e}}}^{z_{i}^{\tau }({t_{s}})\beta }} + L{e^{LB}}]{w_{\min }}}}{{2L{e^{LB}}{w_{g}}}}\biggr] \Biggr\vert \le t\Big| {z} \Biggr\rbrace \\ &\quad \le {\mathrm{{E}}}_{z}\frac{D^{2}(\sqrt {2}) t^{2}}{{2}} e^{-2n t^{2}}= \frac{D^{2}(\sqrt {2}) t^{2}}{{2}} e^{-2n t^{2}}. \end{aligned}$$

Note that \(U = {p_{\tau }}{{\mathrm{{e}}}^{ - LB}}/2\), thus we put \(\frac{{2L{e^{LB}}t}}{{{w_{\min }}}}= \frac{{\lambda _{b1}U}}{{\sqrt{{d_{\max }}} }}= \frac{{\lambda _{b1}{p_{\tau }}{{\mathrm{{e}}}^{ - LB}}}}{2{\sqrt{{d_{\max }}} }}\) in (5.22), which implies

$$\begin{aligned} {P}\bigl({\mathcal{B}}_{1}^{c}\bigr)\le \sum _{g=1}^{G_{n}}\sum_{j=1}^{d_{g}}P({ \mathcal{B}}_{1gj})\le d_{\mathrm{max}}G_{n} \frac{D^{2}(\sqrt {2}) t^{2}}{{2}} e^{-2n t^{2}} \end{aligned}$$

with \(t = \frac{{{\lambda _{b1}}{p_{\tau }}{{\mathrm{{e}}}^{ - 2LB}}{w_{\min }}}}{{4L\sqrt{{d_{\max }}} }}\).

Let \(d_{\mathrm{max}}G_{n}\frac{D^{2}(\sqrt {2}) t^{2}}{{2}} e^{-2n t^{2}}=d_{\mathrm{max}}G_{n}\frac{D^{2}(\sqrt {2}) t^{2}}{{2}} (G_{n})^{1-A^{2}}\), it gives \(t = \frac{A}{{\sqrt{2} }}\sqrt{\frac{{\log ({G_{n}})}}{n}}\). Then we have

$$\begin{aligned} {P}\bigl({\mathcal{B}}_{1}^{c}\bigr)\le \frac{{{d_{{\mathrm{{max}}}}}{G_{n}}{D^{2}}(\sqrt {2}){A^{2}}\log ({G_{n}})}}{{4n}}{({G_{n}})^{1 - {A^{2}}}} \end{aligned}$$
(5.23)

with the tuning parameter \({\lambda _{b1}}\) determined by

$$ {\lambda _{b1}} = \frac{{4tL\sqrt{{d_{g}}} }}{{{p_{\tau }}{{\mathrm{{e}}}^{ - 2LB}}{w_{\min }}}} = \frac{{2\sqrt{2} LA{{\mathrm{{e}}}^{2LB}}\sqrt{{d_{g}}} }}{{{p_{\tau }}{w_{\min }}}} \sqrt{ \frac{{\log ({G_{n}})}}{n}}. $$

For the event \({{\mathcal{B}}_{2}}\), we have \(K=\sqrt{2}\) and \(V=2\) in Lemma 5.6. Define

$$\begin{aligned} P\bigl({\mathcal{B}}_{2}^{c}\bigr) &=\mathrm{E}_{z} P \Biggl\lbrace \sup_{\substack{ {t_{s}} \in [0,\tau ], \\ {\beta \in {{\mathcal{S}}_{M}}(\beta ^{*})} }} \Biggl\vert {\frac{1}{n} \sum_{i = 1}^{n} {{1} ( {{T_{i}} \ge {t_{s}}} ){\mathrm{e} ^{z_{i}^{\tau }({t_{s}}) \beta }} - {{\mathrm{{E}}}_{T}} \bigl[1(T \ge {t_{s}}){\mathrm{e} ^{z^{\tau }(T) \beta }}\bigr]} } \Biggr\vert \le {{e^{LB}}t}\Big| {z} \Biggr\rbrace \\ &=P \Biggl\lbrace \sup_{\substack{ {t_{s}} \in [0, \tau ], \\ {\beta \in {{\mathcal{S}}_{M}}(\beta ^{*})} }} \Biggl\vert {\frac{1}{n} \sum_{i = 1}^{n} {{1} ( {{T_{i}} \ge {t_{s}}} ){\mathrm{e} ^{z_{i}^{\tau }({t_{s}}) \beta -LB}} - {{\mathrm{{E}}}_{T}} \bigl[1(T \ge {t_{s}}){{{\mathrm{e}} } ^{z^{\tau }(T) \beta -LB}}\bigr]} } \Biggr\vert \le t\Big| {z} \Biggr\rbrace \\ &\le \frac{D^{2}(\sqrt {2}) t^{2}}{{2}} e^{-2n t^{2}}. \end{aligned}$$

Note that \(U = {p_{\tau }}{{\mathrm{{e}}}^{ - LB}}/2\), thus we set \({{e^{LB}}t}={{\lambda _{b1}U}}= \frac{{\lambda _{b1}{p_{\tau }}{{\mathrm{{e}}}^{ - LB}}}}{2}\) in (4.11). It gives \({P}({\mathcal{B}}_{2}^{c})\le \frac{D^{2}(\sqrt {2}) t^{2}}{{2}} e^{-2n t^{2}}\) with \(t = \frac{{\lambda _{b2}{p_{\tau }}{{\mathrm{{e}}}^{ - 2LB}}}}{2}\).

Assign \(\frac{D^{2}(\sqrt {2}) t^{2}}{{2}} e^{-2n t^{2}}=\frac{D^{2}(\sqrt {2}) t^{2}}{{2}}p^{-A^{2}}\), it implies \(t = \frac{A}{{\sqrt{2} }}\sqrt{\frac{{\log p}}{n}}\). Therefore, the tuning parameter \({\lambda _{b2}}\) is determined by

$$ {\lambda _{b2}} = \frac{{\sqrt{2} A{{\mathrm{{e}}}^{2LB}}}}{{{p_{\tau }}}} \sqrt{\frac{{\log p}}{n}} $$

such that

$$\begin{aligned} P\bigl({\mathcal{B}}_{2}^{c}\bigr) \le \frac{{{D^{2}}(\sqrt {2}){t^{2}}}}{2}{e^{ - 2n{t^{2}}}} = \frac{{{D^{2}}(\sqrt{2}){A^{2}}\log p}}{{4n}}{p^{ - {A^{2}}}}. \end{aligned}$$
(5.24)

Finally, we obtain by combining (5.21), (5.23), and (5.24)

$$ \begin{aligned} P({\mathcal{B}})&\ge P({{\mathcal{B}}_{0}}) + P({{ \mathcal{B}}_{1}}) + P({{\mathcal{B}}_{2}}) - 2\\ &\ge 1-2{e^{ - np_{\tau }^{2}/2}}- \frac{{{d_{{\mathrm{{max}}}}}{G_{n}}{D^{2}}(\sqrt {2}){A^{2}}\log ({G_{n}})}}{{4n}}{G_{n}^{1 - {A^{2}}}}- \frac{{{D^{2}}(\sqrt {2}){A^{2}}\log p}}{{4n}}{p^{ - {A^{2}}}}. \end{aligned} $$

6 Conclusions and future study

In this paper, we focus on the survival analysis problem by proportional hazard regressions, which includes situations when both the number of covariates p and sample size n are increasing, and \(p\gg n\). When \(p>n\), the classical partial likelihood estimation is over-parameterized and requires Lasso or weighted group Lasso regularization estimation to obtain a stable and satisfactory fitting of proportional hazard regressions. Under the group stabil condition, the sharp oracle inequalities for weighted group Lasso regularized misspecified Cox models are derived. The upper bound of their \(\ell _{2,1}\)-estimation error is determined by the tuning parameter with the rate \(O ( \sqrt{ \frac{\log p}{n}} )+O ( \sqrt{\frac{\log (G_{n})}{n}} )\). The obtained nonasymptotic oracle inequalities imply that the penalized estimator is consistent when \(\log p /n \to 0\) under mild conditions. The rate is rate-optimality in the minimax sense.

In the future study, the statistical inferences (confidence interval and testing for the coefficient, FDR control) are left for further studies.

Availability of data and materials

This is a purely mathematical paper. Data analysis is not applicable.

References

  1. Andersen, P.K., Borgan, O., Gill, R.D., Keiding, N.: Statistical Models Based on Counting Processes. Springer, Berlin (1993)

    Book  MATH  Google Scholar 

  2. Andersen, P.K., Gill, R.D.: Cox’s regression model for counting processes: a large sample study. Ann. Stat. 10(4), 1100–1120 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  3. Bartlett, P.L., Mendelson, S., Neeman, J.: L1-regularized linear regression: persistence and oracle inequalities. Probab. Theory Relat. Fields 154(1), 193–224 (2012)

    Article  MATH  Google Scholar 

  4. Bickel, P.J., Ritov, Y.A., Tsybakov, A.B.: Simultaneous analysis of lasso and Dantzig selector. Ann. Stat. 37, 1705–1732 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  5. Blazere, M., Loubes, J.M., Gamboa, F.: Oracle inequalities for a group lasso procedure applied to generalized linear models in high dimension. IEEE Trans. Inf. Theory 60(4), 2303–2318 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  6. Cox, D.R.: Regression models and life-tables. J. R. Stat. Soc., Ser. B, Methodol. 34, 187–220 (1972)

    MathSciNet  MATH  Google Scholar 

  7. Cox, D.R.: Partial likelihood. Biometrika 62, 269–276 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  8. Dvoretzky, A., Kiefer, J., Wolfowitz, J.: Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. Ann. Math. Stat. 27(3), 642–669 (1956)

    Article  MathSciNet  MATH  Google Scholar 

  9. Fan, J., Li, R.: Variable selection for Cox’s proportional hazards model and frailty model. Ann. Stat. 30, 74–99 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  10. Greenshtein, E., Ritov, Y.A.: Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10(6), 971–988 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  11. Honda, T., Hardle, W.K.: Variable selection in Cox regression models with varying coefficients. J. Stat. Plan. Inference 148, 67–81 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  12. Huang, H., Gao, Y., Zhang, H., Li, B.: Weighted lasso estimates for sparse logistic regression: non-asymptotic properties with measurement error. Acta Math. Sci. (2021, in press). arXiv preprint, arXiv:2006.06136

  13. Huang, J., Sun, T., Ying, Z., Yu, Y., Zhang, C.H.: Oracle inequalities for the lasso in the Cox model. Ann. Stat. 41(3), 1142–1165 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  14. Kanehisa, M., Goto, S.: KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28(1), 27–30 (2000)

    Article  Google Scholar 

  15. Knight, K., Fu, W.: Asymptotics for lasso-type estimators. Ann. Stat. 28, 1356–1378 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  16. Kong, S., Nan, B.: Non-asymptotic oracle inequalities for the high-dimensional Cox regression via lasso. Stat. Sin. 24(1), 25–42 (2014)

    MathSciNet  MATH  Google Scholar 

  17. Lemler, S.: Oracle inequalities for the lasso in the high-dimensional Aalen multiplicative intensity model. Ann. Inst. Henri Poincaré Probab. Stat. 52(2), 981–1008 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  18. Lounici, K., Pontil, M., Van De Geer, S., Tsybakov, A.B.: Oracle inequalities and optimal inference under group sparsity. Ann. Stat. 39(4), 2164–2204 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  19. Massart, P.: The tight constant in the Dvoretzky–Kiefer–Wolfowitz inequality. Ann. Probab. 18, 1269–1283 (1990)

    Article  MathSciNet  MATH  Google Scholar 

  20. Rosenwald, A., Wright, G., Chan, W.C., Connors, J.M., Campo, E., Fisher, R.I., Giltnane, J.M.: The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N. Engl. J. Med. 346(25), 1937–1947 (2002)

    Article  Google Scholar 

  21. Struthers, C.A., Kalbfleisch, J.D.: Misspecified proportional hazard models. Biometrika 73(2), 363–369 (1986)

    Article  MathSciNet  MATH  Google Scholar 

  22. Talagrand, M.: Sharper bounds for Gaussian and empirical processes. Ann. Probab. 22, 28–76 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  23. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc., Ser. B, Methodol. 58, 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  24. Tibshirani, R.: The lasso method for variable selection in the Cox model. Stat. Med. 16(4), 385–395 (1997)

    Article  Google Scholar 

  25. van der Vaart, A.W., Wellner, J.A.: Weak Convergence and Empirical Processes: With Applications to Statistics. Springer, Berlin (1996)

    Book  MATH  Google Scholar 

  26. Wainwright, M.J.: High-Dimensional Statistics: A Non-asymptotic Viewpoint, vol. 48. Cambridge University Press, Cambridge (2019)

    Book  MATH  Google Scholar 

  27. Wang, S., Nan, B., Zhu, N., Zhu, J.: Hierarchically penalized Cox regression with grouped variables. Biometrika 96(2), 307–322 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  28. Yan, J., Huang, J.: Model selection for Cox models with time-varying coefficients. Biometrics 68(2), 419–428 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  29. Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc., Ser. B, Stat. Methodol. 68(1), 49–67 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  30. Zhang, D.X.: Tail bounds for the suprema of empirical processes over unbounded classes of functions. Acta Math. Sin. 22, 339–345 (2006)

    Article  MATH  Google Scholar 

  31. Zhang, H., Chen, S.X.: Concentration inequalities for statistical inference. arXiv preprint, arXiv:2011.02258

  32. Zhang, H., Jia, J.: Elastic-net regularized high-dimensional negative binomial regression: consistency and weak signals detection. Stat. Sin. (2021). https://doi.org/10.5705/ss.202019.0315

    Article  Google Scholar 

  33. Zhang, H., Wu, X.: Compound Poisson point processes, concentration and oracle inequalities. J. Inequal. Appl. 2019(1), 312 (2019)

    Article  Google Scholar 

  34. Zhang, H.H., Lu, W.: Adaptive lasso for Cox’s proportional hazards model. Biometrika 94(3), 691–703 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  35. Zhao, H., Wu, Q., Li, G., Sun, J.: Simultaneous estimation and variable selection for interval-censored data with broken adaptive ridge regression. J. Am. Stat. Assoc. 115, 204–216 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  36. Zhou, S., Zhou, J., Zhang, B.: High-dimensional generalized linear models incorporating graphical structure among predictors. Electron. J. Stat. 13(2), 3161–3194 (2019)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

Ting Yan (tingyanty@mail.ccnu.edu.cn) and Huiming Zhang (huimingzhang@um.edu.mo) are co-coresponding authors. The authors are listed in alphabetical order and they contributed equally to this work. We would like to thank the two reviewers for taking the time to read our paper and for providing excellent suggestions and comments. The first author would like to show sincere gratefulness to the advisor Prof. Jinzhu Jia for his guidance of high-dimensional statistics. The authors also thank Prof. Hui Zhao for helpful discussions.

Funding

Yan Ting is partially supported by the National Natural Science Foundation of China (No. 11771171), the Fundamental Research Funds for the Central Universities. Zhang Huiming is supported in part by the University of Macau under UM Macao Talent Programme (UMMTP-2020-01).

Author information

Authors and Affiliations

Authors

Contributions

The authors completed the paper and approved the final manuscript.

Corresponding authors

Correspondence to Ting Yan or Huiming Zhang.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xiao, Y., Yan, T., Zhang, H. et al. Oracle inequalities for weighted group lasso in high-dimensional misspecified Cox models. J Inequal Appl 2020, 252 (2020). https://doi.org/10.1186/s13660-020-02517-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13660-020-02517-3

Keywords