1 Introduction

Longitudinal and functional data are frequently encountered in biomedicine, epidemiology, and other fields of natural and social sciences. See, Hand and Crowder (1996), Ramsay and Silverman (2005) and Ferraty (2011) for abundant interesting examples. In recent years, various functional regression models are proposed to conduct a better description and analysis for this complicated data, since it allows the responses or the regressors, or both, are curves. According to the difference in the concrete forms of the responses and regressors, functional regression models are usually subdivided into three categories, including scalar-on-function regression (see, e.g., Cai and Hall 2006; Chen, Hall and Müller 2011; Sang et al. 2018), function-on-function regression (Yao et al. 2005a; He et al. 2010; Wang et al. 2018) and function-on-scalar regression (Wu et al. 2013; Zhang and Wang 2015). For more details about the functional regression models and its applications, one can see Morris (2015), Wang et al. (2016) and Reiss et al. (2017) for an overview.

Motivated by the regression analysis of imaging (or functional) data, such as diffusion tensor imaging (DTI) and positron emission tomography (PET) (Zhu et al. 2007; Friston 2009), many nonparametric and semi-parametric regressions for functional responses have been proposed, see, for example, Luo et al. (2016) and Li et al. (2017). A functional varying-coefficient mixed effects model, which can be used for modeling the dynamic impacts of the covariates of interest on the response, is one of the most powerful tools in addressing these scientific problems for imaging data. Let \(Y_{i}(t)\) be the functional response variable for subject i, \(i=1,\ldots ,n\), and \({{\mathbf {x}}}_{i}=(x_{i1},\ldots ,x_{ip})^{T}\) be the associated p-dimensional covariate vector, then the functional varying-coefficient mixed effects model takes the form

$$\begin{aligned} Y_{i}(t)=\mathbf {x}_{i}^{T}{{\varvec{\alpha }}}(t)+\eta _{i}(t)+\varepsilon _{i}(t), \end{aligned}$$
(1)

where \({{\varvec{\alpha }}}(t)=(\alpha _{1}(t),\ldots ,\alpha _{p}(t))^{T}\) is the p-dimensional unknown but smooth coefficient function vector, \(\varepsilon _{i}(t)\) are measurement errors, and \(\eta _{i}(t)\) is the random effect function that characterizes individual curve variation (subject-effect) from \(\mathbf {x}_{i}^{T}{{\varvec{\alpha }}}(t)\). Without loss of generality, throughout this paper we assume that \(t\in \mathcal {T}=[0,1]\). Moreover, {\(\eta _{i}(t): t\in \mathcal {T}\)} is a stochastic process indexed by \(t\in \mathcal {T}\) that captures the within-curve dependence. We assume that \(\eta _{i}(t)\) are independent copies of \(\eta (t)\sim \mathrm {SP}(0, \Sigma )\), where \(\mathrm {SP}(\mu , \Sigma )\) denotes a stochastic process with mean function \(\mu (t)\) and covariance function \(\Sigma (s,t)\).

Model (1), as a classical function-on-scalar regression has been widely applied for functional data analysis and extensively studied in the existing literature. For instance, for model (1) without involving the random effect item, Wu and Chiang (2000) proposed two kernel estimators based on componentwise local least-squares criteria to estimate the varying-coefficient functions; Chiang et al. (2001) employed a componentwise smoothing spline method to estimate the time-varying coefficients and studied the corresponding asymptotic properties. Zhang and Chen (2007) adopted the idea of “smoothing first, then estimation” to construct the local polynomial kernel reconstruction-based estimator for mean, covariance and error variance functions for model (1) and derived their asymptotics. Zhu et al. (2012) extended model (1) to a multivariate case and systematically studied the theoretical properties of the resulting estimators obtained by local linear smoother. Zhu et al. (2014) further proposed a spatially varying coefficient model (SVCM) for neuroimaging study and investigated the asymptotic properties of the parameter estimators. Other more studies on varying-coefficient models and mixed effects models for longitudinal and functional data, readers may refer to Fan and Zhang (2008) and Chen and Wang (2011) for comprehensive discussions.

However, these aforementioned estimation approaches for model (1) are mainly based on kernel smoothing and smoothing spline, both of which ignore the within-subject correlation information in the estimation procedures. For longitudinal and functional data, it is well known that there exists a high correlation between the observations on the grid of each datum. As a result, these existing methods might lead to a loss of efficiency. Additionally, the kernel smoothing method utilizes only observations in the neighborhood of the target point for estimation. It is noticeable that these neighborhoods usually contain only a few observations in longitudinal studies, and the within-subject correlation may causes observations out of the neighborhoods carrying significant information. In view of this, the local kernel method might result in inefficient estimators. Therefore, some new techniques improving the efficiency of the conventional local kernel method are highly desirable in longitudinal and functional data analysis. Chen et al. (2010) proposed a global partial likelihood method to study the nonparametric proportional hazards model. The main idea of their global approach is to use all observations instead of only local observations near the target point for estimation to improve the efficiency. Zhou et al. (2018) employed that global smoothing idea combining with the full quasi-likelihood method to conduct the estimation of mean and covariance functions for sparse functional data. Motivated by the idea of global smoothing, we present an efficient estimation procedure for estimating the varying-coefficient functions in model (1) for longitudinal and sparse functional response data. This procedure combines the generalized least squares method and a modified local kernel smoothing criterion which uses the observations on the whole grid of each subject in estimation. An iterative but simple algorithm is proposed to compute the estimator.

The proposed method overcomes the potential drawbacks of existing local kernel methods and improves the efficiency of the resulting estimators. Specifically, there are several advantages for the new method: (i) the proposed estimation procedure involves the global smoothing idea that not only absorbs the advantages of the local linear smoother by using the observations near the target point for estimation, but also utilizes the information available at data outside the neighborhood, which is not used in the local kernel smoothing principle; (ii) the within-subject covariance structure is also incorporated into the estimation procedure to improve the estimators; and (iii) the iterative procedure is easy to implement so that it can be applied to solve various application problems. We derive both uniform consistency and asymptotic normality for the estimated varying coefficient functions. Simulation studies are conducted to compare the performances of our method and the local linear method in terms of root mean squared error, and the results reveal that the proposed method has a smaller root mean squared error than its competitor. This discovery further indicates that the results obtained by our method can more accurately characterize the dynamic relationship between a set of covariates and the functional response variable in application problems. Finally, a real data example is presented to illustrate the newly proposed approach.

The rest of this article is organized as follows. Section 2 introduces the estimation procedure. The theoretical properties including consistency and asymptotic normality of the estimators are established in Sect. 3. Section 4 presents the simulation studies which are performed to evaluate the performance of the method. Section 5 is devoted to applying this method to analyze the white matter tract dataset obtained from Alzheimer’s Disease Neuroimaging Initiative (ADNI). Section 6 concludes the paper and discusses future work. Technical proofs are given in the “Appendix”.

2 Methods

In this paper, we focus on sparse data situations. Suppose that we observe the random functions \(Y_{i}\) at discrete points \(t_{ij}\), that is,

$$\begin{aligned} Y_{ij}=\mathbf {x}_{i}^{T}{{\varvec{\alpha }}}(t_{ij})+\eta _{i}(t_{ij}) +\varepsilon _{ij},~~j=1,\ldots ,m_{i}, ~i=1,\ldots ,n, \end{aligned}$$
(2)

where the observation locations \(t_{ij}\) are independent and identically distributed random variables with density function \(f(\cdot )\). Throughout this paper, we shall assume that \(\mathbf {x}_{i}\), \(t_{ij}\), \(\eta (\cdot )\) and \(\varepsilon (\cdot )\) are all independent and the noise terms, \(\varepsilon _{ij}\), are independent and identically distributed random errors with mean zero and variance \(\sigma ^{2}\). Suppose that \(2\le m_{i}\le M\), for some finite constant M, for all \(i=1,\ldots ,n\). Our proposed estimation procedure contains two key steps: \((\mathrm{I})\) calculate the initial estimate \(\tilde{{\varvec{\alpha }}}(t)\) of \({{\varvec{\alpha }}}(t)\) and construct the nonparametric covariance estimation for \(\Sigma (s,t)\); \((\mathrm{II})\) use the estimated covariance function and the initial estimate \(\tilde{{\varvec{\alpha }}}(t)\) to develop a refined estimate for \({{\varvec{\alpha }}}(t)\), denoted by \(\widehat{{\varvec{\alpha }}}(t)\).

2.1 Initial estimation

We adopt the local linear scatter-plot smoother (Fan and Gijbels 1996) to estimate the varying coefficient function \({{\varvec{\alpha }}}(t)\) and the covariance function \(\Sigma (s,t)\). For any T in a small neighborhood of t, applying the Taylor’s expansion, we have \(\alpha _{k}(T)\approx \alpha _{k}(t)+\dot{\alpha }_{k}(t)(T-t)\equiv a_{k}+b_{k}(T-t)\) for \(k=1,\ldots ,p\), where \(\dot{\alpha }_{k}(t)=d{\alpha }_{k}(t)/dt\). Denote \({\varvec{\mathcal {A}}}_{0}=(a_{1},\ldots ,a_{p})^{T}\) and \({\varvec{\mathcal {A}}}=(a_{1},\ldots ,a_{p},b_{1},\ldots ,b_{p})^{T}\). Let \(K(\cdot )\) be a kernel function and \(K_{h}(\cdot )=h^{-1}K(\cdot /h)\) be a scaled kernel with bandwidth h. Then the local linear estimate of the regression coefficients, \({{\varvec{\alpha }}}(t)\), is simply \(\tilde{{\varvec{\alpha }}}(t)=\widehat{{\varvec{\mathcal {A}}}}_{0}\) where

$$\begin{aligned} \widehat{{\varvec{\mathcal {A}}}}=\underset{{\varvec{\mathcal {A}}}}{\mathrm{arg ~min}}\left\{ \sum _{i=1}^{n}\sum _{j=1}^{m_i}\Big [Y_{ij}-\sum _{k=1}^{p}\left[ a_{k}+b_{k}(t_{ij}-t)\right] x_{ik}\Big ]^{2}K_{h_{\alpha }}(t_{ij}-t)\right\} , \end{aligned}$$
(3)

Let \(C_{ijl}=(Y_{ij}-\mathbf {x}_{i}^{T}\tilde{{\varvec{\alpha }}}(t_{ij})) (Y_{il}-\mathbf {x}_{i}^{T}\tilde{{\varvec{\alpha }}}(t_{il}))\) be the “raw” covariances, where \(\tilde{{\varvec{\alpha }}}(\cdot )\) is the estimated varying coefficient function obtained from (3). Note that \(E(C_{ijl}|\mathrm{\mathbf { x}}_{i},t_{ij},t_{il})\approx \Sigma (t_{ij},t_{il})+\sigma ^2I(t_{ij}=t_{il})\), where \(I(\cdot )\) is an indicator function. Therefore the local linear surface smoother for covariance function \(\Sigma (s,t)\) can be achieved by the raw covariances with their diagonal elements being removed, and this yields that \(\widehat{\Sigma }(s,t)=\widehat{\beta }_{0}\), where \(\widehat{{\varvec{\beta }}}=(\widehat{\beta }_{0},\widehat{\beta }_{1},\widehat{\beta }_{2})^{T}\) and

$$\begin{aligned} \widehat{{\varvec{{\beta }}}}=&\underset{{\varvec{{\beta }}}}{\mathrm{arg ~min}}\Bigg \{\sum _{i=1}^{n}\sum _{1\le j\ne l\le m_{i}} K_{h_{C}}(t_{ij}-s)K_{h_{C}}(t_{il}-t)\nonumber \\&\times [C_{ijl}-\beta _{0}-\beta _{1}(t_{ij}-s)-\beta _{2}(t_{il}-t)]^{2}\Bigg \}. \end{aligned}$$
(4)

In order to estimate \(\sigma ^{2}\), we first estimate the variance \(R(t)=\Sigma (t,t)+\sigma ^{2}\) of Y(t). Let \(\widehat{R}(t)\) be the local linear smoother of R(t), then \(\widehat{R}(t)=\widehat{\beta }_{0}\), where \(\widehat{{\varvec{\beta }}}=(\widehat{\beta }_{0},\widehat{\beta }_{1})^{T}\) and

$$\begin{aligned} \widehat{{\varvec{{\beta }}}}=\underset{{\varvec{{\beta }}}}{\mathrm{arg ~min}}\left\{ \sum _{i=1}^{n}\sum _{j=1}^{m_i} [C_{ijj}-\beta _{0}-\beta _{1}(t_{ij}-t)]^{2}K_{h_{V}}(t_{ij}-t)\right\} . \end{aligned}$$
(5)

The estimate for the variance \(\sigma ^{2}\) is then,

$$\begin{aligned} \widehat{\sigma }^{2}=\frac{1}{|\mathcal {T}|}\int _{\mathcal {T}} \{\widehat{R}(t)-\widehat{\Sigma }(t,t)\} \mathrm {d}t, \end{aligned}$$
(6)

where the notation \(|\mathcal {T}|\) denotes the length of the interval \(\mathcal {T}\). The more detailed process for the estimation of mean and covariance functions for longitudinal and sparse functional data, one can refer to Yao et al. (2005b) and Li and Hsing (2010).

2.2 Refined estimation for varying-coefficient functions

We know that the local kernel smoothing uses only local observations near the target point for estimation. However, the within-subject correlation, presented in the longitudinal and functional data, may result in observations outside the neighborhood of the target point carrying relevant information. Therefore, the local linear method used in Sect. 2.1 for estimating the varying-coefficient functions may reduce to loss of efficiency. In this subsection, we employ the generalized least squares method incorporating a modification of local linear smoothing to conduct a refined estimator for \({{\varvec{\alpha }}}(\cdot )\). This new method not only adopts the idea that uses all observations instead of the local observations near the target point for estimation but also incorporates the estimated covariance structure which can be used to account for the within-subject correlation into the estimation process to improve the resulting estimators.

Write \({{\varvec{Y}}}_{i}=(Y_{i1},\ldots ,Y_{i,m_{i}})^{T}\), \({{\varvec{t}}}_{i}=(t_{i1},\ldots ,t_{i,m_{i}})^{T}\), \({{\varvec{\alpha }}}({{\varvec{t}}}_{i})=(\alpha _{1}({{\varvec{t}}}_{i}),\ldots ,\alpha _{p}({{\varvec{t}}}_{i}))^{T}\) and \(\alpha _{j}({{\varvec{t}}}_{i})=(\alpha _{j}(t_{i1}),\ldots ,\alpha _{j}(t_{i,m_{i}}))^{T}\) for \(j=1,\ldots ,p\). Note that \(E({{\varvec{Y}}}_{i}|\mathbf {x}_{i},{{\varvec{t}}}_{i})={{\varvec{\alpha }}}^{T}({{\varvec{t}}}_{i})\mathbf {x}_{i}\) and \(\mathrm{cov}({{\varvec{Y}}}_{i}|\mathbf {x}_{i},{{\varvec{t}}}_{i})={{\varvec{\Sigma }}}_{i}+\sigma ^{2}I_{m_{i}}\), where \({\varvec{\Sigma }}_{i}=\{{\varvec{\Sigma }}(t_{ij},t_{il})\}^{m_{i}}_{j,l=1}\) is the covariance matrix of \(\eta _{i}({{\varvec{t}}}_{i})=(\eta _{i}(t_{i1}),\ldots ,\eta _{i}(t_{i,m_{i}}))^{T}\) and \(I_{m_{i}}\) is the identity matrix of order \(m_{i}\) for \(i=1,\ldots ,n\). For convenience, let \({\varvec{V}}_{i}={{\varvec{\Sigma }}}_{i}+\sigma ^{2}I_{m_{i}}\). Then the generalized least squares estimator of \({{\varvec{\alpha }}}(t)\) is obtained by minimizing the follow objective function

$$\begin{aligned} {} Q({{\varvec{\alpha }}})=\sum _{i=1}^n \{{{\varvec{Y}}}_{i}-{{\varvec{\alpha }}}^{T}({{\varvec{t}}}_{i})\mathbf {x}_{i}\}^{T}{{\varvec{V}}}_{i}^{-1}\{{{\varvec{Y}}}_{i}-{{\varvec{\alpha }}}^{T}({{\varvec{t}}}_{i})\mathbf {x}_{i}\}. \end{aligned}$$
(7)

Firstly, we replace \({{\varvec{V}}}_{i}\) with its estimates. Since we do not assume any structure on the covariance function of Y(t) other than that it is a smooth function, the local linear estimates presented in Sect. 2.1 is recommended. Let \(\widehat{{\varvec{\Sigma }}}_{i}\) be the estimated covariance matrix of \({\varvec{\Sigma }}_{i}\) and \(\widehat{\sigma }^{2}\) be the estimated variance of measurement errors obtained from (4), (5) and (6). Then \({{\varvec{V}}}_{i}\) is substituted by \(\widehat{{\varvec{V}}}_{i}=\widehat{{{\varvec{\Sigma }}}}_{i}+\widehat{\sigma }^{2}I_{m_{i}}\).

Secondly, we adopt a similar strategy in Chen et al. (2010) and Zhou et al. (2018) to approximate \({{\varvec{\alpha }}}(t)\). Specifically, let \(B_{n}(t)\) be a neighborhood of t, it is easy to see that

$$\begin{aligned} {{\varvec{\alpha }}}(t_{ij})={{\varvec{\alpha }}}(t_{ij})I(t_{ij}\in B_{n}(t))+{{\varvec{\alpha }}}(t_{ij})\{1-I(t_{ij}\in B_{n}(t))\}. \end{aligned}$$
(8)

Based on the idea of local linear smoother, we use linear function \({{\varvec{\zeta }}}+\dot{{{\varvec{\zeta }}}}(t_{ij}-t)\) to approximate \({{\varvec{\alpha }}}(t_{ij})\) in the first term of (8) and replace \(I(t_{ij}\in B_{n}(t))\) with a kernel function \(K\{(t_{ij}-t)/h\}\). Then for \(t_{ij}\) in a neighborhood of t, we have

$$\begin{aligned} {{\varvec{\alpha }}}(t_{ij})\approx \{{{\varvec{\zeta }}}+\dot{{{\varvec{\zeta }}}}(t_{ij}-t)\}hK_{h}(t_{ij}-t)+ {{\varvec{\alpha }}}(t_{ij})\{1-hK_{h}(t_{ij}-t)\}, \end{aligned}$$

where \({{\varvec{\zeta }}}=(\zeta _{1},\ldots ,\zeta _{p})^{T}\), \(\dot{{{\varvec{\zeta }}}}=(\dot{\zeta }_{1},\ldots ,\dot{\zeta }_{p})^{T}\) and \(K _{h}(\cdot )=h^{-1}K(\cdot /h)\) is a scaled kernel function with bandwidth h. This expansion is of great use in longitudinal and functional data analysis by the reason of it not only absorbs the advantage of local linear smoother but also utilizes the information available at data outside the neighborhood region \(B_{n}(t)\). Substituting this approximation and the estimated covariance matrix into (7), we can obtain the refined estimator for \({{\varvec{\vartheta }}}=({{\varvec{\zeta }}}^{T},\dot{{{\varvec{\zeta }}}}^{T})^{T}\) by solving the following equation

$$\begin{aligned} \frac{\partial Q({{\varvec{\vartheta }}};{{\varvec{\alpha }}})}{\partial {\varvec{\vartheta }}}=\sum _{i=1}^n{{\varvec{\Theta }}}_{i}^{T}{{\varvec{K}}}_{i,h} \widehat{{\varvec{V}}}_{i}^{-1}\left\{ {{\varvec{Y}}}_{i}-h{{\varvec{K}}}_{i,h}{{\varvec{\Theta }}}_{i} {{\varvec{\vartheta }}}-(I_{m_{i}}-h{{\varvec{K}}}_{i,h}){{\varvec{\alpha }}}^{T}({{\varvec{t}}}_{i}) \mathbf {x}_{i}\right\} =0, \end{aligned}$$
(9)

where \({{\varvec{K}}}_{i,h}=\mathrm{diag}(K_{h}(t_{i1}-t),\ldots ,K_{h}(t_{i,m_{i}}-t))\) and

$$\begin{aligned} {{\varvec{\Theta }}}_{i}=\begin{pmatrix} x_{i1}&{}\cdots &{} x_{ip} &{} x_{i1}(t_{i1}-t) &{}\cdots &{} x_{ip}(t_{i1}-t)\\ \vdots &{}\ddots &{}\vdots &{}\vdots &{}\ddots &{}\vdots \\ x_{i1}&{}\cdots &{} x_{ip} &{} x_{i1}(t_{i,m_{i}}-t) &{}\cdots &{} x_{ip}(t_{i,m_{i}}-t) \end{pmatrix}. \end{aligned}$$

Note that the refined estimator can not be directly obtained by solving the the Eq. (9) as there still exists an unknown function vector \({{\varvec{\alpha }}}({{\varvec{t}}}_{i})\). Therefore, we present an iterative procedure by substituting \({{\varvec{\alpha }}}({{\varvec{t}}}_{i})\) with its previous estimator to address this problem. Let \(t_{1}<t_{2}<\cdots <t_{M}\) be the distinct points of \(t_{ij}\) for \(i=1,\ldots ,n\) and \(j=1, \ldots ,m_{i}\). The iterative algorithm is described as follows.

  • Step 0 Taking the estimator fulfilled by (3) as the initial values of \({{\varvec{\alpha }}}(t)\), denoted by \({{\varvec{\alpha }}}_{[0]}(t)=\tilde{{\varvec{\alpha }}}(t)\) for \(t=t_{1},\ldots ,t_{M}\).

  • Step 1 For each \(t=t_{1},\ldots ,t_{M}\), solve the Eq. (9) in which \({{\varvec{\alpha }}}(t)\) is replaced by its previous estimator, that is,

    $$\begin{aligned} \sum _{i=1}^n {{\varvec{\Theta }}}^{T}_{i}{{\varvec{K}}}_{i,h}\widehat{{\varvec{V}}}_{i}^{-1}\{{{\varvec{Y}}}_{i} -h{{\varvec{K}}}_{i,h}{{\varvec{\Theta }}}_{i}{{\varvec{\vartheta }}}-(I_{m_{i}}-h{{\varvec{K}}}_{i,h}) {{\varvec{\alpha }}}^{T}_{[s-1]}({{\varvec{t}}}_{i}) \mathbf {x}_{i}\}=0, \end{aligned}$$
    (10)

    where \({{\varvec{\alpha }}}^{T}_{[s-1]}({{\varvec{t}}}_{i})\) is the \((s-1)\)th iterative estimator of \({{\varvec{\alpha }}}^{T}({{\varvec{t}}}_{i})\). (10) yields the estimator \(\widehat{{\varvec{\vartheta }}}(t)\), given by

    $$\begin{aligned} \widehat{{\varvec{\vartheta }}}(t) = \begin{pmatrix} \widehat{{\varvec{\zeta }}}(t) \\ {\widehat{\dot{{\varvec{\zeta }}}}}(t) \end{pmatrix} =&\left\{ \sum _{i=1}^n {{\varvec{\Theta }}}^{T}_{i}{{\varvec{K}}}_{i,h}\widehat{{\varvec{V}}}_{i}^{-1}h{{\varvec{K}}}_{i,h} {{\varvec{\Theta }}}_{i}\right\} ^{-1} \nonumber \\&\times \sum _{i=1}^n {{\varvec{\Theta }}}^{T}_{i}{{\varvec{K}}}_{i,h}\widehat{{\varvec{V}}}_{i}^{-1} \{{{\varvec{Y}}}_{i}-(I_{m_{i}}-h{{\varvec{K}}}_{i,h}){{\varvec{\alpha }}}^{T}_{[s-1]}({{\varvec{t}}}_{i}) \mathbf {x}_{i}\}. \end{aligned}$$
    (11)

    Hence, \({{\varvec{\alpha }}}_{[s]}(t)=\widehat{{\varvec{\zeta }}}(t)\).

  • Step 2 Repeat Step 1 until convergence.

The selection of the bandwidth h for the estimation procedure presented above is achieved by the leave-one-curve-out cross-validation (Rice and Silverman 1991). Specifically, let \(\widehat{{\varvec{\alpha }}}^{(-i)}(\cdot )\) be the estimated varying-coefficient function of \({{\varvec{\alpha }}}(\cdot )\) after removing the ith subject from the data. Then the optimal bandwidth h can be chosen by minimizing the cross-validation score based on the squared prediction error given by

$$\begin{aligned} PE(h)=\frac{1}{n}\sum _{i=1}^n\frac{1}{m_{i}}\sum _{j=1}^{m_{i}}\{Y_{i}(t_{ij}) -\mathbf {x}_{i}^{T}\widehat{{\varvec{\alpha }}}(t_{ij})^{(-i)}\}^2. \end{aligned}$$

Remark 1

The proposed iterative estimation procedure can be readily implemented as there is a closed-form at each step. Compared to the local linear method used in Zhu et al. (2012), the newly presented estimation approach may be more efficient since it takes the within-subject correlation structure and the observations on the whole grid of each subject into consideration, two of which are significantly important in longitudinal and functional data analysis. It is not hard to see that the estimation in (3) is, in essence, a local marginal estimation, assuming that the within-subject covariance matrices are identities. This leads to a working independence estimator for \({{\varvec{\alpha }}(t)}\) which ignores the correlation among the observations on the grid of each subject, and further causes a loss of efficiency.

Remark 2

Replacing \({{\varvec{\alpha }}(t)}\) with its estimate to model (1), we also can conduct the estimation of each random effect function \(\eta _{i}(t)\). Many existing non-parametric methods such as the local linear regression have been developed to directly estimate all individual functions \(\eta _{i}(t)\), \(i=1,\ldots ,n,\) (see, e.g., Zhu et al. 2012; Li et al. 2017). However, these non-parametric methods may be not suitable for our sparse data situations when the number of observation points for each subject is small. Functional principle component analysis (FPCA) as a key technique in characterizing the features of unknown functions in functional data analysis, is another feasible method for the estimation of \(\eta _{i}(t)\). We recommend using the version of FPCA developed by Yao et al. (2005b) which they refer to as principal components analysis through conditional expectation (PACE) for longitudinal data to estimate each random effect function \(\eta _{i}(t)\). The PACE approach is appealing since it is implemented mainly based on the estimated population covariance function of \(\eta _{i}(t)\), which utilizes the whole sample information and can be easily obtained by using the method presented in Sect. 2.1. Comprehensive surveys on FPCA for longitudinal or sparse functional data were developed in Yao et al. (2005b) and Hall et al. (2006).

3 Asymptotic properties

In this section, we study the asymptotic properties of our estimators. Some regularity conditions needed for the theoretical developments are stated in “Appendix A.1”. All these conditions are very mild and most of these are also adopted by Chen et al. (2010) and Zhou et al. (2018). The detailed proofs for the following theorems are given in “Appendix A.2–A.3”.

To start with, we give some notations which will be used in the theorems. Let \(\nu _{r}=\int u^{r}K(u)du\) and \(\upsilon _{r}=\int u^{r}K^{2}(u)du\) for notational simplicity. For a vector of functions \({{\varvec{\alpha }}}(t)\) of t, denote \(\dot{{{\varvec{\alpha }}}}(t)=d{{\varvec{\alpha }}}(t)/dt\) and \(\ddot{{{\varvec{\alpha }}}}(t)=d^{2}{{\varvec{\alpha }}}(t)/dt^{2}\), which are the componentwise derivatives. Let \(\mathcal {V}_{i,jk}\) be the (jk)th element of \({{\varvec{V}}}_{i}^{-1}=({{\varvec{\Sigma }}}_{i}+\sigma ^{2} I_{m_{i}})^{-1}\). Denote \(\gamma _{1}(s)=\lim \limits _{n\rightarrow \infty }n^{-1}\sum _{i=1}^n\sum _{j=1}^{m_{i}}E[\mathcal {V}_{i,jk}|t_{ij}=s]\), \(\gamma _{2}(s_{1},s_{2})=\lim \limits _{n\rightarrow \infty }n^{-1}\sum _{i=1}^n\sum _{j\ne k}^{m_{i}}E[\mathcal {V}_{i,jk}|t_{ij}=s_{1},t_{ik}=s_{2}]\) and \(\Omega _{X}=\text {E}[\mathbf{x }\mathbf{x }^{T}]\). Let \(\mathcal {I}\) be the identity operator, and \(\mathcal {A}\) be the linear operator satisfying \(\mathcal {A}(H)(t)=\int _{z}\gamma _{2}(t,z)/\gamma _{1}(t)H(z)f(z)dz\) for any function H.

Theorem 1

Under the regularity Conditions (C1)–(C8) in “Appendix A.1”, when \(n\rightarrow \infty \), we have

$$\begin{aligned} \sup _{t\in [0,1]}|\widehat{\alpha }_{j}(t)-\alpha _{j}(t)|\rightarrow 0,~~j=1,\ldots ,p, \end{aligned}$$

in probability.

Theorem 2

Suppose that conditions of Theorem 1 hold. Then, for any \(t\in (0, 1)\), we have

$$\begin{aligned} (\mathcal {I}+\mathcal {A})(\widehat{{\varvec{\alpha }}}-{{\varvec{\alpha }}}) (t)=(nh)^{-1/2}{{\varvec{\Gamma }}}(t)^{1/2}{{\varvec{\varphi }}}+\frac{1}{2} \upsilon _{2}h^{2}\ddot{{\varvec{\alpha }}}(t)+o_{p}\{h^2+(nh)^{-1/2}\}, \end{aligned}$$

where \({{\varvec{\Gamma }}}(t)=\upsilon _{0}[\Omega _{X}f(t)\gamma _{1}(t)]^{-1}\) and \({{\varvec{\varphi }}}\) is a random vector following the standard normal distribution.

Theorem 2 implies the following Corollary 3.

Corollary 3

Under the Conditions (C1)–(C8) in the “Appendix A.1”, we have

$$\begin{aligned} \sqrt{nh}\left\{ (\widehat{{\varvec{\alpha }}}-{{\varvec{\alpha }}})(t)-\frac{1}{2} (\mathcal {I}+\mathcal {A})^{-1}(\ddot{{\varvec{\alpha }}})(t)\upsilon _{2}h^{2}\right\} \overset{\mathcal {L}}{\longrightarrow } N(0,{{\varvec{\Xi }}}(t)) \end{aligned}$$

for any \(t\in (0, 1)\), where \({{\varvec{\Xi }}}(t)=\{(\mathcal {I}+\mathcal {A})^{-1}({{\varvec{\Gamma }}}^{1/2})(t)\} \{(\mathcal {I}+\mathcal {A})^{-1}({{\varvec{\Gamma }}}^{1/2})(t)\}^{T}\).

Remark 3

Theorem 1 shows the proposed estimator \(\widehat{\alpha }_{j}(\cdot )\), \(j=1,\ldots ,p\), is uniformly convergent. Theorem 2 shows the asymptotic bias of \((\mathcal {I}+\mathcal {A})(\widehat{{\varvec{\alpha }}}-{{\varvec{\alpha }}})(t)\) is of order \(O(h^2)\), and the asymptotic variance is of order \(O((nh)^{-1})\). As a consequence, the theoretical optimal bandwidth is of order \(n^{-1/5}\). These results are similar to that of Zhou et al. (2018). The asymptotic properties presented above can be viewed as a generalization of the results in Zhou et al. (2018) from mean function estimation to varying-coefficient function estimation. These results will be more widely used in solving various practical problems than that of Zhou et al. (2018). Especially, it can provide theoretical guidance for analyzing the dynamic relationship between a set of covariates and the functional response variable, rather than only exploring the feature of functional variables.

4 Simulation studies

In this section, we are going to perform some simulation studies to illustrate the finite sample properties of the proposed estimation method. In consideration of the estimation procedure we proposed is aimed at estimating the varying-coefficient functions, we will use the accuracy of the varying-coefficient function estimators to demonstrate how well the procedure works. The performances are evaluated via the bias, standard deviation (SD), and root mean squared error (RMSE), at given grid points, which also are used by Zhou et al. (2018). Specifically, the bias, SD and RMSE of an estimator \(\widehat{\alpha }(\cdot )\) are respectively defined as

$$\begin{aligned} \mathrm{Bias}=\left[ \frac{1}{n_{g}}\sum _{j=1}^{n_{g}}\big \{E\widehat{\alpha }(t_{j}) -\alpha (t_{j})\big \}^{2}\right] ^{1/2},~~~~\mathrm{SD}=\left[ \frac{1}{n_{g}} \sum _{j=1}^{n_{g}}E\big \{\widehat{\alpha }(t_{j})-E\widehat{\alpha }(t_{j})\big \}^{2}\right] ^{1/2} \end{aligned}$$

and \(\mathrm{RMSE=(bias^2+SD^2)^{1/2}}\), where \(\widehat{\alpha }(\cdot )\) is an estimator of \(\alpha (\cdot )\), \(E\widehat{\alpha }(t_{i})\) is estimated by the sample mean based on the N simulated data sets, and \(t_{j}(j=1,\ldots ,n_{g})\) are the grid points at which the function \(\widehat{\alpha }(\cdot )\) is evaluated. In addition, comparisons are also made between our method and the local linear method employed in Zhu et al. (2012) (Local hereinafter). In the following simulated examples, the Epanechnikov kernel is used for nonparametric regression. We set grid points \(n_{g}=100\) and compute the results based on \(N=500\) replications.

Example 4

Generate data from the following model

$$\begin{aligned} Y_{ij}=\mathbf {x}_{i}^{T}{{\varvec{\alpha }}}(t_{ij})+\eta _{i}(t_{ij})+\varepsilon _{ij}, \end{aligned}$$

where \(t_{ij}\sim U[0,1]\), \(\varepsilon _{ij}\sim N(0,\sigma ^2)\) and \(\mathbf {x}_{i}=(1,x_{i1},x_{i2})^{T}\) for \(i=1,\ldots ,n\), \(j=1,\ldots ,m_{i}\). Moreover, \((x_{i1},x_{i2})^{T}\sim N((0,0)^{T},\mathrm{diag}(0.9,0.9)+0.1(1,1)^{\otimes 2})\), and \(\eta _{i}(t_{ij})=\xi _{i1}\psi _{1}(t_{ij})+\xi _{i2}\psi _{2}(t_{ij})\), where \(\xi _{ij}\sim N(0,\lambda _{j}^2)\) for \(j=1,2\). Furthermore, \((x_{i1}, x_{i2}), t_{ij}, \varepsilon _{ij}, \xi _{i1}\) and \(\xi _{i2}\) are independent random variables. We set \((\sigma , \lambda _{1}, \lambda _{2})= (0.3, 1, 0.5)\), and the varying-coefficient functions, and eigenfunctions as follows:

$$\begin{aligned} \alpha _{1}(t)= & {} t^2,~~\alpha _{2}(t)=-3(t-0.5)^2+1,~~\alpha _{3}=4t(t-1),\\ \psi _{1}(t)= & {} \sqrt{2}\sin (2\pi t), ~~\psi _{2}(t)=\sqrt{2}\cos (2\pi t). \end{aligned}$$

We first conducted simulation studies under the following two regular time points settings.

  • Setting 1: \(m_{i}=5\) with various sample size: \(n=200\) and \(n=400\).

  • Setting 2: \(n=150\) with various time points: \(m_{i}=7\) and \(m_{i}=13\).

Setting 1, two different sample size \(n=200, 400\), and a fixed time points \(m_{i}=5\) are designed. Setting 2, sample size \(n=150\) is fixed with various time points settings \(m_{i}=7\) and \(m_{i}=13\), which guarantees the total number of observations is comparable to that in Setting 1, by decreasing the number of subjects and increasing the observation number from the same individual. Tables 1 and 2 report the numerical results for the proposed method and Local method under Setting 1 and Setting 2, respectively.

Table 1 The simulation results for Setting 1 in Example 4. Bias, SD and RMSE of \(\widehat{\alpha }_{1}(\cdot )\), \(\widehat{\alpha }_{2}(\cdot )\) and \(\widehat{\alpha }_{3}(\cdot )\) using the proposed method and the Local method
Table 2 The simulation results for Setting 2 in Example 4. Bias, SD and RMSE of \(\widehat{\alpha }_{1}(\cdot )\), \(\widehat{\alpha }_{2}(\cdot )\) and \(\widehat{\alpha }_{3}(\cdot )\) using the proposed method and the Local method

Example 5

To illustrate that the proposed procedure is applicable even to the case that the observation times for each trajectory are randomly missed, we replicated Example 4 with an irregular time points design used in Fan, Huang and Li (2007). Specifically, we will generate the observation times randomly, assuming that each individual has a set of “scheduled” observation points, \(\{0,1,2,\ldots ,11\}\), and each scheduled time, except time 0, has a \(20\%\) probability of being skipped, then a uniform [0, 1] random variable is added to a nonskipped scheduled time so that the actual observation time is in fact a random perturbation of a scheduled time. Table 3 presents the numerical results for this case with sample size \(n=200\) and 400.

Table 3 The simulation results for Example 5. Bias, SD and RMSE of \(\widehat{\alpha }_{1}(\cdot )\), \(\widehat{\alpha }_{2}(\cdot )\) and \(\widehat{\alpha }_{3}(\cdot )\) using the proposed method and the Local method

Tables 1, 2 and 3 summarize the numerical performance of the proposed method and the Local method through the Bias, SD and RMSE. It is shown from Tables 1, 2 and 3 that the RMSE of the proposed method is consistently smaller than that of the Local method for all varying-coefficient function estimators \(\widehat{\alpha }_{1}(t)\), \(\widehat{\alpha }_{2}(t)\) and \(\widehat{\alpha }_{3}(t)\), which indicates that the proposed method performs better than the Local method. Moreover, the proposed method has less bias and variance than the Local method, mainly because of the incorporation of the correlation information and the modified local linear technique. Tables 1, 2 and 3 also show that the Bias, SD and RMSE values decrease as the sample size increases, which corroborates with our consistency and asymptotic normality established in Sect. 3. From Tables 1 and 2, we find that the resulting estimators obtained in Example 4 with Setting 1 have smaller RMSE than those obtained in Example 4 with Setting 2 when the total number of observations for these two settings is close. This finding implies that the sparsity is beneficial to the performance of the proposed method and the estimators of the varying-coefficient functions can be improved with increasing the number of subjects. For all simulation cases, we chose \(h_{0}=0.12\) as the bandwidth for the initial estimates of the varying-coefficient functions and the estimates of the covariance functions. We discover that the proposed method is not sensitive to the initial values. Furthermore, the optimal bandwidth for the proposed method is slightly larger than that for the Local method. This is likely because the correlation matrix is incorporated into the estimation process of the proposed method, which can be treated as the weight as well as the kernel matrix, and further lower data usage. Therefore, the larger bandwidth for the proposed method is needed to enlarge the included range of data.

We also apply the PACE method to Example 5 to conduct the estimation of \(\eta _{i}(t)\). A comparison is made between the PACE method and the local linear approach. The PACE method is implemented with the FPCA function available in the package fdapace (Dai et al. 2019). We choose the first four functional principal components to construct the estimators of \(\eta _{i}(t)\) \((i=1,\ldots ,n)\) for the PACE method. The bandwidth for the local linear method is selected by generalized cross-validation (GCV). We evaluate the performances through the root average square errors (RASE) given by

$$\begin{aligned} \mathrm{RASE}=\left[ \frac{1}{n}\sum _{i=1}^{n}\frac{1}{n^{*}_{g}} \sum _{j=1}^{n^{*}_{g}}\big \{\widehat{\eta }_{i}(t^{*}_{ij})- \eta _{i}(t^{*}_{ij})\big \}^{2}\right] ^{1/2}, \end{aligned}$$

where \(t^{*}_{ij}(j=1,\ldots ,n^{*}_{g})\) are the grid points at which the function \(\widehat{\eta }_{i}(\cdot )\) is evaluated. Since the observation points for each subject are sparse, we set \(n^{*}_{g}=50\) to calculate the RASE. Table 4 presents the mean and standard deviation (SD) of RASE, both of which are computed based on 200 simulation replications. We can see from Table 4 that all the mean and SD of RASE for the PACE method are significantly smaller than that of the local linear method. These results imply that the PACE method is superior to the local linear method for the estimation of \(\eta _{i}(t)\). Moreover, we observe that the mean value of RASE for the PACE method decreases as the sample size increases, while the value for the local linear method has hardly changed. This is likely due to the fact that the PACE method for constructing the estimators \(\widehat{\eta }_{i}(t)\) \((i=1,\ldots ,n)\) is mainly based on the estimated population covariance function of \(\eta _{i}(t)\), which applies the whole sample information in estimation. Nevertheless, the local linear method for directly estimating each \(\eta _{i}(t)\) only utilizes the observations on the grid of its corresponding curve, therefore, the resulting estimators could not be improved by increasing the size of samples.

Table 4 The mean and SD of RASE for estimators \(\widehat{\eta }_{i}(\cdot )\) using the PACE method and the local linear method with data generated from Example 5

To have a more visual display about how well the proposed estimation procedure works, we choose Example 5 with sample size \(n=200\) as an example. Figure 1 exhibits the averaged estimated varying-coefficient functions based on 50 replications and the boxplot of RASE of the varying-coefficient functions over 500 simulations. It can be seen from Fig. 1a–c that the estimated varying-coefficient curves (broken lines), obtained by the proposed method, closely resemble the corresponding true curves (solid lines), reflecting the good performance of the proposed procedure. Figure 1d shows the proposed method performs better than the Local method in terms of RASE.

Fig. 1
figure 1

The averaged estimated varying-coefficient functions in a)–(c over 50 simulations and the boxplot for the three varying-coefficient functions in d over 500 simulations under Example 5 with sample size \(n=200\). The solid lines are the true functions and the broken lines are the estimated one in (a)–(c)

5 Real data analysis

We apply model (2) and the newly proposed procedure to analyze a real diffusion tensor imaging (DTI) dataset with \(n=213\) subjects that was collected from NIH Alzheimer’s Disease Neuroimaging Initiative (ADNI) study. One goal of the NIH ADNI is to test whether genetic, structural and functional neuroimaging, and clinical data can be integrated to evaluate the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD). For more details on this dataset, see the ADNI publicly available database (http://adni.loni.usc.edu/). In diffusion data analysis, fractional anisotropy (FA), which quantifies the directional strength of white matter tract structure, at a particular location, is one of the most used measures, and has been widely applied to statistical analyses in many imaging studies. During the stage of data processing, we calculate the FA curve for each subject using a Tract-Based Spatial Statistics (TBSS) pipeline analysis (Smith et al. 2006) and the FMRIB software library tool. The more details for this process, one can refer to Li et al. (2017).

In this study, 213 subjects with age of 48.4–73.2 years old are sampled to assess the impact of some indicators which are commonly occurred in diffusion data, such as gender (123 male and 90 female), handiness (193 righthand and 20 left-hand), education level (years, ranges from 9 to 20 (years), mean 15.91), Alzheimer’s disease (AD) status (\(19.6\%\)), mild cognitive impairment (MCI) status (\(55.1\%\)) and Mini-Mental State Exam (MMSE) of ADNI on the FA. We aim to evaluate the effects of these indicators on the FA using model (2). Specifically, the response \(Y_{i}(t)\) represents the FA curve and the covariate vector \(\mathbf{x}_{i}\) includes an intercept term, the gender variable coded by a dummy variable indicating for male, the age of the subject, a handiness indicator coded by a dummy variable indicating for left-hand, the education level, an indicator for AD status, an indicator for MCI status and MMSE. We randomly selected 15 observations from the 83 grid points of each FA curve, so that we can compare the results obtained from the sparse data and the complete data and further validate the new approach. We standardize all variables to be mean zero and variance one, and apply the proposed procedure established in Sect. 2 to estimate the corresponding coefficients.

Fig. 2
figure 2

The estimated varying coefficients for intercept, sex, age, handiness, education, AD, MCI and MMSE and their \(95\%\) pointwise confidence intervals for the ADNI data. The dashed lines are the estimates fulfilled by our proposed estimation procedure with sparse data and the solid lines are estimated by the local linear method with complete data

Figure 2 shows the estimated varying-coefficient functions corresponding to intercept, gender, age, education, AD status, MCI status and MMSE indicator and the associated 95% pointwise confidence intervals for both sparse and complete data. The pointwise confidence intervals were obtained with the standard error based on 200 bootstrap resampling. The dashed lines are the estimates fulfilled by the proposed method for sparse data and the solid lines are estimated by the local linear method for complete data. Figure 2 indicates that the proposed estimation method works well even though data are sparse; it also reveals that age, handiness, education, AD and MCI have more effects on FA than that of gender and MMSE. The intercept function characterizes the nonlinear overall trend of FA. It can be seen from Fig. 2 that the coefficient curve for gender is positive and that for MMSE is near to the zero line, while the coefficients for the other five covariates of interest are negative, at most of the grid points. These findings may indicate that males tend to have higher FA values than females, MMSE has no remarkable effects on FA, AD and MCI patients tend to have lower FA values and being older, left-hand, or more educated may lead to smaller FA values. These discoveries are also consistent with the previous analysis in Li et al. (2017). Finally, we calculated the squared prediction error (i.e., \(SPE=\sum _{i=1}^n\sum _{j=1}^{m_{i}}\{Y_{i}(t_{ij})-\widehat{Y}_{i}^{*}(t_{ij})\}^2\), where \(\widehat{Y}_{i}^{*}\) is the predicted trajectory for the ith subject by using the data with the ith subject being pulled out.) both for the proposed method and the local linear method. The corresponding SPE values for these two approaches are respectively 18.34 and 18.42, which implies that the proposed method has a smaller prediction error than that of the local linear method for analyzing this data.

Figure 3 presents the estimated eigenvalues and eigenfunctions for the covariance function \(\Sigma (s,t)\) with sparse data. The top four nonzero eigenvalues are 0.0028, 0.00047, 0.0003 and 0.00016 respectively. The first four eigenvalues explain about \(94\%\) of the total variability, while, it can be seen from Fig. 3 that the remaining eigenvalues rapidly drop to zero. The first eigenfunction, with a dominant eigenvalue accounting for \(70\%\) of the total variation, varies simply in structure. While the other eigenfunctions also have quite simple structures and most of them are roughly sinusoidal.

Fig. 3
figure 3

The estimated eigenvalues in decreasing order (left), the estimated eigenfunctions (middle) and the corresponding cumulate FVE (right) for ADNI data analysis

6 Conclusions

In this paper, we used a functional varying-coefficient mixed effects model to investigate the dynamic association between a set of covariates of interest and the functional response. An iterative estimation procedure was developed to estimate the varying-coefficient functions in this model for longitudinal or sparse functional data. The proposed estimation procedure not only employs the idea that uses all observations instead of the local observations near the target point for estimation but also incorporates the estimated covariance structure into the estimation process to improve the resulting estimators. The asymptotic properties including uniform consistency and asymptotic normality for the proposed estimators were also established. Finally, we compared the proposed procedure with the local linear method numerically through several simulation studies. The simulation results reveal that our method performed better than the local linear method in terms of the root mean squared error. A real data example is also provided to illustrate our approach.

Our future research will be focused on addressing several important issues. First, it would be interesting to gain a more efficient estimator for the covariance function by using the newly proposed estimation procedure. Second, applying the proposed estimation procedure to investigate other more complex regression models, such as functional index models (Jiang and Wang 2011), single-index varying coefficient (SIVC) models (Luo et al. 2016) and functional additive models (Zhang, Park and Wang 2013), is also a meaningful issue. Third, it is also worthwhile to conduct statistical inferences such as hypothesis tests for longitudinal and functional data.