A reproducing kernel Hilbert space approach to high dimensional partially varying coefficient model

https://doi.org/10.1016/j.csda.2020.107039Get rights and content

Abstract

Partially varying coefficient model (PVCM) provides a useful class of tools for modeling complex data by incorporating a combination of constant and time-varying covariate effects. One natural question is that how to decide which covariates correspond to constant coefficients and which correspond to time-dependent coefficient functions. To handle this two-type structure selection problem on PVCM, those existing methods are either based on a finite truncation way of coefficient functions, or based on a two-phase procedure to estimate the constant and function parts separately. This paper attempts to provide a complete theoretical characterization for estimation and structure selection issues of PVCM, via proposing two new penalized methods for PVCM within a reproducing kernel Hilbert space (RKHS). The proposed strategy is partially motivated by the so-called “Non-Constant Theorem” of radial kernels, which ensures a unique and unified representation of each candidate component in the hypothesis space. Within a high-dimensional framework, minimax convergence rates for the prediction risk of the first method is established when each unknown time-dependent coefficient can be well approximated within a specified RKHS. On the other hand, under certain regularity conditions, it is shown that the second proposed estimator is able to identify the underlying structure correctly with high probability. Several simulated experiments are implemented to examine the finite sample performance of the proposed methods.

Introduction

Varying coefficient models are a class of useful tools for studying the time-dependent effects of variables. They provide a more flexible approach than the classical linear models and have particular advantages in analyzing dynamic effects of covariates on responses measured repeatedly.

Suppose that {(yi,ti,xi)}i=1n is an independent and identically distributed (i.i.d.) random sample, drawn from the varying coefficient model y=j=1pfjo(t)xj+ε,where x=(x1,,xp)T[1,1]p is the p-dimensional covariate vector, and each fjo is an unknown coefficient function and the noise term ε is standard normal variable. yR is the response of interest and t[0,1] is the univariate index variable, typically, t refers to time or age in many applications. We assume that x and t are independent for our analysis.

The standard varying coefficient models (Fan and Zhang, 1999, Huang et al., 2002, Yang et al., 2006) assume that all time dependent covariates have the same degree of smoothness and are spatially homogeneous, nevertheless, such assumption for the model (1.1) may lead to overfitting and lack of interpretation when some covariate effects are indeed time-invariant, or some of coefficients are irrelevant to the response. This partially motivates us to investigate the PVCM in high dimensional setting. Under the standard structural sparse setting, one often assumes that only s0 out of p functions fjo’s are non-zero constant and independent of the index t, s1 functions are time-varying functions and (ps1s0) functions are identically equal to zero. Under high dimensional setting, of primary interest is the case where the additive form in (1.1) is fairly sparse, so that s0+s1p. Indeed, such assumption can be relaxed to general setting where the true functions can be well approximated by the candidate sparse function space. Our main focus here is on situation where p is larger than the sample size n.

In this paper, we primarily investigate the estimation of the model coefficients and the identification of a model structure for the model (1.1) under the high-dimensional setting. For this purpose, we present two penalized approaches based on different mixed 1 penalties within a RKHS. The first proposed method achieves minimax rates in terms of prediction ability and the second one with a simpler penalty ensures structure selection consistency with high probability, which will be stated clearly in Theorem 1, Theorem 2, Theorem 3.

The problem, known variously as model selection, pattern recovery or structure identification, arises in a broad variety of contexts (Donoho, 2006, Miller, 1990, Koller and Friedman, 2009), and such problem in the varying coefficient models has been studied in existing literature. For example, Wang et al. (2007) and Wang and Xia (2008) use the group Lasso and SCAD methods for estimation and model selection in varying coefficient models with a fixed number of coefficients and covariates. In recent years, there have been several existing works for the high dimensional varying coefficient models, and the screening procedure and those penalized methods are used commonly, such as Song et al. (2014), Fan et al. (2014) and Liu et al. (2014), and others. We notice that, these above work treat all the relevant coefficients to be functional forms and do not consider the problem of structure identification including constant coefficients as mentioned above.

Usually, there is no (or few) priori knowledge that which coefficients are indeed constant and which are time-varying functions. In addition to comments as earlier, treating constant coefficients as time-varying would result in a loss in the convergence rate. Hence, identifying multiple structures automatically based on data is an interesting issue for statistical inference. Several other recent papers have considered such multiple structure problems in high dimensional varying coefficient models, for example, Cheng et al. (2014) first adopted a screening procedure to reduce the number of covariates to a moderate order, and then identified both the varying and constant coefficients using a group SCAD-based approach. However, the first step of screening technique is often unable to identify group-relevant features subset while every feature may be very weak to the response. This further affects the quality of model estimation and structure identification. Using the quadratic approximation for local log-likelihood function and the adaptive group Lasso, Li et al. (2015) introduced a penalized weighted least squares procedure to select the relevant covariates and then identify the constant coefficients among the coefficients of the selected covariates. Nevertheless, this method proposed by Li et al. (2015) is an inefficient estimation to some extent, since the first step does not use the information of constant coefficients. Klopp and Pensky (2015) derived the minimax lower bounds within certain frameworks for high dimensional PVCM. In addition, based on a finite basis approximation, an adaptive estimator for the PVCM is constructed with the group Lasso penalty, which achieves those lower bounds under some regularity conditions.

In the current article, we also consider the PVCM when the solution is sparse, and some of relevant coefficients are constant and the rest of relevant coefficients are time dependent. We find that, a key problem for estimating the PVCM (1.1) is the unique representation of each candidate estimator. Otherwise, it is hard to identify constant terms from the derived estimators. Fortunately, Corollary 4.44 in [33] has proved the “no constants” theorem for Gaussian RKHSs, which says that Gaussian RKHSs do not contain any non-zero constant under mild conditions. This conclusion has been extended to general radial kernels (Scovel et al., 2010). Based on this conclusion and considering that those kernel-based approaches are developed in a unified mathematical framework and well-justified in theory, we propose two new penalized approaches to the PVCM within RKHS. To be precise, our first proposed approach is based on a mixed penalty term, that is, the standard 1-norm penalty for selecting sparse vector coefficients and a combination of an empirical 2-norm of function and the RKHS-norm for selecting sparse function-type coefficients. The use of the empirical 2-norm of function in our first method enables us to derive minimax rates under mild conditions, which is inspired partially by several previous work such as Koltchinskii and Yuan, 2010, Raskutti et al., 2012, Suzuki and Sugiyama, 2013 and among others. On the other hand, to ensure structure selection consistency, it suffices to only use the RKHS-norm in our second method.

Compared with existing approaches mentioned as above, our methods are formulated only by one step. Moreover, Our infinite dimensional estimation approach defined on the RKHSs avoids the estimation bias, which is often resulted from the standard finite truncation way of basis functions such as spline functions. The main goal of this paper is to provide a complete theoretical characterization for estimation and structure selection issues of PVCM, via proposing two new penalized methods for PVCM within a RKHS.

Our main contribution of the first proposed method is to establish upper bounds of prediction errors on 2 and n norms under two different kinds of conditions on covariates, respectively. More precisely, Theorem 1 provides upper bounds on the prediction error under very general conditions, without any “correlation” constraint on covariates. We also show that our derived rates are nearly optimal under certain frameworks. If additional mutual incoherent conditions are satisfied on covariates, like restricted eigenvalue condition (Bickel et al., 2009), we provide sharp upper bounds on the prediction error for the proposed estimator. Moreover, our upper bounds are shown to be optimal in the literature of the PVCM, with appropriate choices of regularization parameters. In addition, our estimator is adaptive to the degree of sparsity.

Our primary contribution of the second proposed method is to prove structure selection consistency in high dimensional setting. Under invertible conditions (Assumption F) and strict incoherent conditions (Assumption G), our second penalized approach achieves model consistency in high dimensional setting. To our knowledge, our results on structure selection consistency via the standard Lasso are new in the framework of high dimensional PVCM. In general, estimation consistency does not guarantee exact recovery of the underlying structure pattern, and studying model consistency requires more stringent conditions such as “irrepresentable” conditions (Zhao and Yu, 2006). In the case of infinite dimensional problems, greater technique challenge is encountered than any analyzing finite dimensional problem. It is also worth noting that, our analysis combines the finite dimensional vector-valued coefficients with the infinite dimensional function-valued coefficients, which is unified in theory and of significant difference from existing work (Cheng et al., 2014, Li et al., 2015).

Finally, our general bounds also provide useful bounds on convergence rates for the two kinds of fundamental statistical models of interest: including linear regression and classical varying-coefficients models. Especially our theoretical results shed light on the connection and differences between these models. Also, the derived conclusion for structure selection consistency can be viewed as an extension of sparse linear models (Wainwright, 2009) and sparse varying-coefficients models.

This work mainly contributes to the statistical theory of parsimonious semiparametric varying-coefficient models by providing both estimation rates and model selection consistency. We also empirically verify its performances through numerical studies. The rest of the paper is organized as follows. In Section 2, we provide background on kernel spaces and some useful properties of RKHS for establishing our penalized methods. Section 3 is devoted to the statement of our main results and discussion of their consequences; it contains the upper bounds on the prediction error under different settings. Structure selection consistency of our second estimator is proved in Section 4, and several simulation results on the first proposed method are shown in Section 5. We present a real data application in Section 6. In Section 7, we provide an explicit proof of Theorem 2, and the other technical details are deferred to the Appendices.

Notation

For a vector vRm, v2=v,v2 is the standard 2-norm in the Euclidean space. For a matrix A, we define A2=supv2=1Av2 as the spectral norm of matrix. For a function f, define the square-integral norm by f22=Xf(x)2dρ(x) associated with some distribution ρ. For any vector function q(t)=(q1(t),,qp(t))T, denote q2=j=1pqj22. Besides, ab means that there are two positive constants c,C such that cabCa. Also, a=O(b) means that there is some positive constant C such that aCb, and also a=o(b) means that ab0. We refer to C with various subscripts as universal positive constants, may differ from line to line. For shorthand, we denote ab=max{a,b} and ab=min{a,b}. [p]={1,2,,p} for ease of notation.

Section snippets

Background and estimation strategy

We begin with a brief review of some of the useful facts about RKHS which will be used in the sequel, before describing a regularized M-estimator with a mixed 1 penalty for the sparse PVCM.

Main results and consequences

This section is devoted to the prediction risk of the Lasso estimator defined by the program (2.3) under two different kinds of frameworks, and discussion of some of their consequences. First of all, we present some new upper bounds for the prediction risk of the Lasso under very weak conditions on the covariates and the true coefficients. Then, Section 3.2 is devoted to some refinements of the prediction error with fast rates based on mutual incoherent conditions. More importantly, we also

Structure selection consistency

In this section, we are concerned with the problem of structure selection consistency for the PVCM (1.1). For notational simplicity, we do not consider the approximation bias in what following. More precisely, let Sˆ0,Sˆ1,andSˆc be the index sets generated by the proposed estimator (2.3) for nonzero constant effects, time-varying effects and null effects, respectively. Namely, Sˆ0={j,aˆj0,gˆj=0},Sˆ1={j,gˆj0},andSˆc={j,aˆj=0,gˆj=0}.Under certain regularity conditions, we will show that our

Simulations

This section implements some simulated results to show the numerical performance of the first proposed method.

We conducted Monte Carlo studies based on the following generating model: yi=j=1pgj(ti)xij+εi,with g1(t)=4sin(2πt), g2(t)=15t(1t), g3(t)=g4(t)=2. Thus in our generating model, the number of non-constants is s0=2 and the number of time-varying coefficients is s1=2. Several simulation scenarios are considered. We set n=100 or 200, and p=50 and 100. Although here p is not very large,

Genome wide association study

In this section, we present some numerical results of Genome Wide Association study for our method. The sheep growth and meat production traits data is taken from (Zhang et al., 2013) and it can be found by the link https://www.ncbi.nlm.nih.gov/gds with series number GSE46231.

The dataset contains 54,241 Single Nucleotide Polymorphisms (SNPs) and the three genotypes of each SNP are coded as 2-dimensional vectors (1,0)T, (0,1)T and (1,1)T. Furthermore, we perform a feature screening (Fan et al.,

Proof for Theorem 2

We begin with establishing an inequality on the error function Δ̂=fˆf and δ̃(t,x)=(fo(t)f(t))Tx. Since fˆ and f are feasible for the proposed program (2.2), it follows from the definition of fˆ that, L(aˆ,gˆ)L(a,g). Hence the error function satisfies the bound 1ni=1n(Δ̂(ti)Txi+δ̃(t,x)εi)2+λ1aˆ1+λ2gˆK,λ31ni=1n(δ̃(t,x)εi)2+λ1a1+λ2gK,λ3.It is easily verified that K,λ3 is a mixed norm for any given λ3>0, and this after some simple algebra yields that Δ̂Txn22ni=1nεi[Δ̂(t

Acknowledgments

We are grateful to three referees and the associate editor for valuable comments and constructive suggestions. The first author’s research is supported partially by National Natural Science Foundation of China (Grant No. 11871277 and 11829101). TS was partially supported by JSPS Kakenhi (18H03201, 18K19793, and20H00576), Japan Digital Design and JST-CREST.

References (46)

  • ScovelC. et al.

    Radial kernels and their reproducing kernel Hilbert spaces

    J. Complexity

    (2010)
  • AronszajnN.

    Theory of reproducing kernels

    Trans. Amer. Math. Soc.

    (1950)
  • BachF.R.

    Consistency of the group Lasso and multiple kernel learning

    J. Mach. Learn. Res.

    (2008)
  • BachF. et al.

    Convex optimization with sparsity-inducing norms

  • BartlettP.L. et al.

    Local Rademacher complexities

    Ann. Statist.

    (2005)
  • BartlettP.L. et al.

    Rademacher and Gaussian complexities: risk bounds and structural results

    J. Mach. Learn. Res.

    (2002)
  • BickelP.J. et al.

    Simultaneous analysis of Lasso and Dantzig selector

    Ann. Statist.

    (2009)
  • BousquetO.

    A Bennet concentration inequality and its application to suprema of empirical processes

    C. R. Math. Acad. Sci. Paris

    (2002)
  • BoydS. et al.

    Convex Optimization

    (2004)
  • BuldyginV.V. et al.

    Metric Characterization of Random Variables and Random Processes

    (2000)
  • ChengM. et al.

    Nonparametric independence screening and structure identification for ultra-high dimensional longitudinal data

    Ann. Statist.

    (2014)
  • CukerF. et al.

    Learning Theory: An Approximation Theory Viewpoint

    (2007)
  • DonohoD.

    Compressed sensing

    IEEE Trans. Inf. Theory

    (2006)
  • FanJ. et al.

    Nonparametric independence screening in sparse ultra-high dimensional varying coefficient models

    J. Amer. Statist. Assoc.

    (2014)
  • FanJ. et al.

    Statistical estimation in varying coefficient models

    Ann. Statist.

    (1999)
  • FumumizuK. et al.

    Statistical convergence of kernel canonical correlation analysis

    J. Mach. Learn. Res.

    (2007)
  • HofmannT. et al.

    Kernel methods in machine learning

    Ann. Statist.

    (2008)
  • HuangJ.Z. et al.

    Varying coefficient models and basis function approximations for the analysis of repeated measurements

    Biometrika

    (2002)
  • KloppO. et al.

    Sparse high dimensional varying coefficient model: non-asymptotic minmax study

    Ann. Statist.

    (2015)
  • KollerD. et al.

    Probabilistic Graphical Models: Principles and Techniques

    (2009)
  • KoltchinskiiV. et al.

    Sparsity in multiple kernel learning

    Ann. Statist.

    (2010)
  • LedouxM.
  • LedouxM. et al.

    Probability in Banach Spaces Isoperimetry and Processes

    (2011)
  • View full text