Nonparametric inference for distribution functions with stratified samples

https://doi.org/10.1016/j.jspi.2021.05.001Get rights and content

Abstract

We consider nonparametric estimation of a distribution function when data are collected from two-phase stratified sampling without replacement. We study the inverse probability weighted empirical distribution function and propose a novel computational procedure to construct a confidence band. Two-phase sampling design induces heterogeneity across strata and dependence due to sampling without replacement. Two major statistical challenges from this design are: (1) the standard practice to approximate sampling without replacement by Bernoulli sampling leads to an incorrect coverage probability, and (2) a complicated limiting process of the proposed estimator does not allow one to analytically compute quantiles of the supremum of the limiting process nor to apply existing bootstrap methods to the proposed estimator. To address these issues, we rigorously establish the asymptotic properties of the proposed estimator and develop a simulation-based method to estimate the limiting process. The finite sample performance is evaluated through a simulation study. A Wilms tumor example is provided.

Introduction

We consider nonparametric estimation of a distribution function F of a random variable X when data are collected from two-phase stratified sampling in biomedical studies. Two-phase stratified sampling design is a cost-effective design used for collecting expensive variables to measure such as rare disease and rare exposure. Examples of this design include stratified case-control design (White, 1986) and stratified case cohort designs (Prentice, 1986, Borgan et al., 2000). At the first phase, the independent and identically distributed (i.i.d.) sample is collected from an infinite population. Variables available at the first phase are incomplete for statistical inference. In order to collect missing variables X, the i.i.d.sample is stratified and subsamples are sampled from each stratum without replacement. The resultant final sample is a heterogeneous and dependent sample because of finite population sampling with different selection probabilities from strata. In this paper, we address bias arising from both heterogeneity across strata and dependence within strata, and develop a rigorous large sample theory for inference of the distribution function.

Statistical inference with two-phase stratified sampling has focused on censored regression modeling (see e.g.accelerated failure time model (Nan et al., 2006, Nan et al., 2009), the additive hazards model (Kulich and Lin, 2000), the Cox proportional hazards model (Prentice, 1986, Self and Prentice, 1988), and the transformation model (Lu and Tsiatis, 2006, Kong et al., 2006, Zeng and Lin, 2014)). The standard assumption adopted in these literature is Bernoulli sampling from each stratum rather than sampling without replacement (see e.g. Breslow and Wellner, 2007, Saegusa and Wellner, 2013 for some exceptions). Although the number of selected observations is random, Bernoulli sampling ensures the independence among observations so that theoretical analysis of statistical methods becomes simple. The approximation of sampling without replacement by Bernoulli sampling fortunately leads to a statistically conservative conclusion in practice because asymptotic variance in Bernoulli sampling is larger than in sampling without replacement (Breslow and Wellner, 2007, Saegusa and Wellner, 2013). Along the same line of reasoning, the confidence band based on Bernoulli sampling, if successfully constructed, would be wider and the corresponding coverage probability would be inflated in sampling without replacement although the methods for confidence bands has not been studied in Bernoulli sampling to the best of our knowledge. To achieve the correct coverage probability, we do not adopt the standard practice of assuming Bernoulli sampling, but rigorously quantify uncertainty arising from sampling without replacement.

Unlike biostatistical literature, dependence has been the main research focus in sampling theory. Variance that accounts for dependence has been studied in various sampling designs including two-phase sampling design (see e.g. Särndal et al., 1992). Note, however, that two-phase sampling in sampling theory (e.g. Särndal et al., 1992, Chen and Rao, 2007) is different from two-phase sampling in the biostatistical literature because sampling at the first phase is from the finite population in sampling theory in addition to subsequent sampling at the second phase. In contrast, our two-phase stratified sampling obtains the i.i.d. @sample from the infinite population at the first phase and collects subsamples from each stratum after stratifying the i.i.d.sample at the second phase. The closest design to our setting in sampling theory is thus stratified sampling without replacement where stratified samples are obtained from the finite population. This is because our sampling design changes into stratified sampling in sampling theory when one treats the i.i.d.sample at the first phase as the fixed finite population by ignoring randomness in sampling from the infinite population.

In stratified design in sampling theory, Bickel and Krieger (1989) adopted two types of bootstrap methods which are designed for survey sampling to construct confidence bands for a distribution function. This method, however, does not produce valid confidence bands in our setting due to the key difference in probabilistic frameworks between biostatistics and sampling theory. In the finite population framework which standard sampling theory adopts, the random variable X is treated as a non-random fixed variable so that statistical methods in sampling theory disregard additional randomness due to sampling X from the infinite population. Accordingly, the confidence band of Bickel and Krieger (1989) would be narrower and the corresponding coverage probability would be deflated in our setting which we show in a simulation study below. In biostatistical applications, scientific phenomena such as disease progression are naturally considered random so that the distribution function F is a more meaningful parameter of interest compared to the framework in sampling theory. To account for randomness in X for biostatistical applications, a novel asymptotic theory is required for precise quantification of uncertainty due to both sampling from the infinite population and subsequent sampling from strata.

In this paper, we study the inverse probability weighted empirical distribution as the proposed nonparametric estimator of the distribution function. The inverse probability weighting is a natural technique and well adopted in sampling theory and missing data literature. The key difference from those literature comes from our probabilistic framework where the variable X is random and sampling from stratum is without replacement as described above. To rigorously evaluate the uncertainty of the proposed estimator in our setting, we apply empirical process theory developed for two-phase stratified sampling by Saegusa and Wellner (2013). We show the uniform consistency of the proposed estimator over the real line and its weak convergence to a Gaussian process. The precise limiting distribution obtained from this approach is a basis for achieving the correct coverage probability for the proposed confidence band of the distribution function.

The main contribution of this paper is the novel computational method to construct a confidence band of the distribution function whether or not X is a continuous or discrete variable. The proposed approach for a confidence band is to estimate quantiles of the supremum of the absolute difference between our estimator and the true function F over the real line. In the i.i.d.setting, this approach is applied to the supremum of the Kolmogorov–Smirnov statistic for a continuous random variable, and its quantiles are analytically obtained from the Kolmogorov–Smirnov distribution (Kolmogorov, 1933, Smirnov, 1944). For non-continuous random variables, the Dvoretzky–Kiefer–Wolfowitz inequality (Dvoretzky et al., 1956) yields an upper bound of quantiles of interest resulting in an inflated coverage probability. An alternative approach is to estimate quantiles by bootstrapping the supremum of (weighted) Kolmogorov–Smirnov statistic. This approach was explored in the i.i.d. @setting by Bickel and Freedman (1981) and in stratified sampling from a finite population by Bickel and Krieger (1989). These bootstrap methods are valid for both continuous and non-continuous random variables. In our setting, the proposed estimator is a sum of dependent variables and its limiting process is a linear combination of multiple Gaussian processes. Accordingly, the Kolmogorov–Smirnov distribution and the Dvoretzky–Kiefer–Wolfowitz inequality which assume the i.i.d.sample do not produce a reasonable estimate of quantiles. A valid bootstrap method is not available either in our setting because existing bootstrap methods reproduce randomness either due to sampling from the infinite population or due to sampling from strata, but not both at the same time.

Besides analytical or bootstrap computation of quantiles, various methods for confidence bands have been proposed. For parametric models, confidence bands are considered for normal distributions (Kanofsky and Srinivasan, 1972), Weibull distributions (Schafer and Angus, 1979), and the location scale parameter model (Cheng and Iles, 1983). For nonparametric models, Bayesian approach with the Dirichlet prior was explored by Breth (1978). Inversion of a nonparametric likelihood test of Berk and Jones (1978) was considered by Owen (1995). For continuous random variables, the optimal confidence band was proposed by Frey (2008) based on the narrowness criterion, and the kernel smoothed estimator of a distribution function was adopted by Wang et al. (2013).

The rest of the paper is organized as follows. In Section 2, we formally introduce two-phase stratified sampling, and the limit behavior of the proposed estimator of F is derived. We present the algorithm to compute the confidence band and study its large sample property in Section 3. Our method is extended to conditional distribution functions in Section 4. The performance of the proposed methodology is evaluated through a simulation study in Section 5. A data example from the national Wilms tumor study is presented in Section 6. All proofs are deferred to the Appendix.

Section snippets

Sampling and estimators

Two-phase stratified sampling is formulated as follows. This formulation is the same as the one studied in Breslow and Wellner (2007) and Saegusa and Wellner (2013).

The variable of interest is a random vector W=(X,Y) taking values in a measurable space (W,A). In this paper, we focus on a random variable X and inference on its cumulative distribution function F but inference on both X and Y (e.g.regression modeling) is of general interest in two-phase stratified design.

Let V=(W̃,Z)V where W̃

Confidence band

The basic idea to obtain a confidence band is to estimate q1α such that PsupxRN|FN(x)F(x)|q1α1α,N,from which the large sample 100(1α)% confidence band is obtained as FN(x)q1αNF(x)FN(x)+q1αN, all xR.One promising approach is to analytically compute quantiles of the limit of supxRN|FN(x)F(x)| as in the i.i.d.setting based on the Kolmogorov–Smirnov distribution, but this approach requires the random variable X to be continuous. Also, a valid resampling method to bootstrap supxR

Extension to conditional distribution given discrete variables

In this section, we extend our methodology to the conditional distribution functions given the discrete variable U whose level corresponds to different risk factors. This extension is not as straightforward as in the i.i.d.setting. Because subsetting the i.i.d.sample by risk factors yields again the i.i.d.subsamples, the same methodology applies to each subsamples for confidence bands. In contrast, our stratified samples require care to correct bias due to heterogeneity and biased sampling.

Our

Simulation study

We performed a simulation study to evaluate the finite-sample performance of the proposed confidence band. We consider three different probability distributions for X: a mixture of beta distributions, a mixture of Poisson distributions, and a mixture of normal distributions. In all three cases, three strata were formed and sampling was conducted with the sampling probabilities approximately 0.3, 0.1, and 0.2 from each stratum. Here we determined the sample sizes at the second phase by e.g. n1=N

Application

We apply the proposed method to data from the national Wilms tumor study (D’Angio et al., 1989). Wilms tumor is a rare kidney cancer for children. The predictor of relapse includes histology of cancer, age at diagnosis, and tumor diameter. Data for all 3915 patients are available and were used to compare different designs (Breslow and Chatterjee, 1999, Breslow et al., 2009, Saegusa, 2019). In our analysis, we check if the empirical distributions based on the entire cohort are contained in the

Acknowledgments

This research was funded by the National Science Foundation, United States (DMS 2014971) and the National Institute of Health, United States (R01AI121259, R56AI140953-01). I would like to thank the editor, the associate editor, and the referee for their constructive comments and suggestions.

References (37)

  • FreyJ.

    Optimal distribution-free confidence bands for a distribution function

    J. Statist. Plann. Inference

    (2008)
  • AkritasM.G.

    Bootstrapping the Kaplan-Meier estimator

    J. Amer. Statist. Assoc.

    (1986)
  • BerkR.H. et al.

    Relatively optimal combinations of test statistics

    Scand. J. Stat.

    (1978)
  • BickelP.J. et al.

    Some asymptotic theory for the bootstrap

    Ann. Statist.

    (1981)
  • BickelP.J. et al.

    Asymptotic normality and the bootstrap in stratified sampling

    Ann. Statist.

    (1984)
  • BickelP.J. et al.

    Confidence bands for a distribution function using the bootstrap

    J. Amer. Statist. Assoc.

    (1989)
  • BorganØ. et al.

    Exposure stratified case-cohort designs

    Lifetime Data Anal.

    (2000)
  • BreslowN.E. et al.

    Design and analysis of two-phase studies with binary outcome applied to Wilms tumour prognosis

    J. R. Stat. Soc. Ser. C. Appl. Stat.

    (1999)
  • BreslowN.E. et al.

    Using the whole cohort in the analysis of case-cohort data

    Amer. J. Epidemiol.

    (2009)
  • BreslowN.E. et al.

    Weighted likelihood for semiparametric models and two-phase stratified samples, with application to Cox regression

    Scand. J. Stat.

    (2007)
  • BrethM.

    Bayesian confidence bands for a distribution function

    Ann. Statist.

    (1978)
  • ChenJ. et al.

    Asymptotic normality under two-phase sampling designs

    Statist. Sinica

    (2007)
  • ChengR.C.H. et al.

    Confidence bands for cumulative distribution functions of continuous random variables

    Technometrics

    (1983)
  • D’AngioG.J. et al.

    Treatment of wilms’ tumor. Results of the third national Wilms’ tumor study

    Cancer

    (1989)
  • DvoretzkyA. et al.

    Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator

    Ann. Math. Stat.

    (1956)
  • GinéE. et al.
  • GrossS.

    Median estimation in sample surveys

  • KanofskyP. et al.

    An approach to the construction of parametric confidence bands on cumulative distribution functions

    Biometrika

    (1972)
  • Cited by (0)

    View full text