Skip to content
Publicly Available Published by De Gruyter August 10, 2020

Nonparametric bootstrap inference for the targeted highly adaptive least absolute shrinkage and selection operator (LASSO) estimator

  • Weixin Cai ORCID logo and Mark van der Laan EMAIL logo

Abstract

The Highly-Adaptive least absolute shrinkage and selection operator (LASSO) Targeted Minimum Loss Estimator (HAL-TMLE) is an efficient plug-in estimator of a pathwise differentiable parameter in a statistical model that at minimal (and possibly only) assumes that the sectional variation norm of the true nuisance functions (i.e., relevant part of data distribution) are finite. It relies on an initial estimator (HAL-MLE) of the nuisance functions by minimizing the empirical risk over the parameter space under the constraint that the sectional variation norm of the candidate functions are bounded by a constant, where this constant can be selected with cross-validation. In this article we establish that the nonparametric bootstrap for the HAL-TMLE, fixing the value of the sectional variation norm at a value larger or equal than the cross-validation selector, provides a consistent method for estimating the normal limit distribution of the HAL-TMLE. In order to optimize the finite sample coverage of the nonparametric bootstrap confidence intervals, we propose a selection method for this sectional variation norm that is based on running the nonparametric bootstrap for all values of the sectional variation norm larger than the one selected by cross-validation, and subsequently determining a value at which the width of the resulting confidence intervals reaches a plateau. We demonstrate our method for 1) nonparametric estimation of the average treatment effect when observing a covariate vector, binary treatment, and outcome, and for 2) nonparametric estimation of the integral of the square of the multivariate density of the data distribution. In addition, we also present simulation results for these two examples demonstrating the excellent finite sample coverage of bootstrap-based confidence intervals.

1 Introduction

We consider estimation of a pathwise differentiable real valued target estimand based on observing n independent and identically distributed observations O 1 , , O n from a data distribution P 0 known to belong to a statistical model M . A target parameter Ψ : M IR is a mapping that maps a possible data distribution P M into real number, while ψ 0 = Ψ ( P 0 ) represents the statistical estimand. The canonical gradient D * ( P ) of the pathwise derivative of the target parameter at a distribution P defines an asymptotically efficient estimator among the class of regular estimators [1]: an estimator ψ n is asymptotically efficient at P 0 if and only if it is asymptotically linear at P 0 with influence curve D * ( P 0 ) :

ψ n ψ 0 = 1 n i = 1 n D * ( P 0 ) ( O i ) + o P ( n 1 / 2 ) .

The target parameter Ψ ( P ) depends on the data distribution P through a parameter Q = Q ( P ) , while the canonical gradient D * ( P ) possibly also depends on another nuisance parameter G ( P ) : D * ( P ) = D * ( Q ( P ) , G ( P ) ) . Both of these nuisance parameters are chosen so that they can be defined as a minimizer of the expectation of a specific loss function: Q ( P ) ) = arg min Q Q ( M ) P L 1 ( Q ) and G ( P ) = arg min G G ( M ) P L 2 ( G ) , where we used the notation P f f ( o ) d P ( o ) . We consider the case that the parameter spaces Q ( M ) = { Q ( P ) : P M } and G ( M ) = { G ( P ) : P M } for these nuisance parameters Q and G are contained in the set of multivariate cadlag functions with sectional variation norm v * [2] bounded by a constant (this norm will be defined in the next section).

We consider a targeted minimum loss-based (substitution) estimator Ψ ( Q n * ) [3], [4], [5], [6] of the target parameter that uses as initial estimator of these nuisance parameters ( Q 0 , G 0 ) the highly adaptive lasso minimum loss-based estimators (HAL-MLE) ( Q n , G n ) defined by minimizing the empirical mean of the loss over the parameter space [7], [8]. Since the HAL-MLEs converge at a rate faster than n 1 / 2 with respect to (w.r.t.) the loss-based quadratic dissimilarities (to be defined later, which corresponds with a rate faster than n 1 / 4 for estimation of Q 0 and G 0 ), this HAL-TMLE has been shown to be asymptotically efficient under weak regularity conditions [7]. Statistical inference could therefore be based on the normal limit distribution in which the asymptotic variance is estimated with an estimator of the variance of the canonical gradient. In that case, inference is ignoring the potentially very large contributions of the higher order remainder which could, in finite samples, easily dominate the first order empirical mean of the efficient influence curve term when the size of the nuisance parameter spaces is large (e.g., dimension of data is large and model is nonparametric).

In this article we propose the nonparametric bootstrap to obtain a better estimate of the finite sample distribution of the HAL-TMLE than the normal limit distribution. The bootstrap fixes the sectional variation norm at the values used for the HAL-MLEs ( Q n , G n ) on a bootstrap sample. We propose a data adaptive selector of this tuning parameter tailored to obtain improved finite sample coverage for the resulting confidence intervals.

1.1 Organization

In Section 2 we formulate the estimation problem and motivate the challenge for statistical inference. In Section 3 we present the nonparametric bootstrap estimator of the actual sampling distribution of the HAL-TMLE which thus incorporates estimation of its higher order stochastic behavior, and can thereby be expected to outperform the Wald-type confidence intervals. We prove that this nonparametric bootstrap is asymptotically consistent for the optimal normal limit distribution. Our results also prove that the nonparametric bootstrap preserves the asymptotic behavior of the HAL-MLEs of our nuisance parameters Q and G, providing further evidence for good performance of the nonparametric bootstrap. Importantly, our results demonstrate that the approximation error of the nonparametric bootstrap estimate of the true finite sample distribution of the HAL-TMLE is mainly driven by the approximation error of the nonparametric bootstrap for estimating the finite sample distribution of a well behaved empirical process. In Section 4 we present a plateau selection method for selecting the fixed sectional variation norm in the nonparametric bootstrap and a bias-correction in order to obtain improved finite sample coverage for the resulting confidence intervals.

In Section 5 we demonstrate our methods for two examples involving a nonparametric model and a specified target parameter: average treatment effect and integral of the square of the data density. In Section 6 we carry out a simulation study to demonstrate the practical performance of our proposed nonparametric bootstrap based confidence intervals w.r.t. their finite sample coverage. We conclude with a discussion in Section 7. Proofs of our Lemma and Theorems have been deferred to the Appendix. We refer to our accompanying technical report for additional bootstrap methods and results based on applying the nonparametric bootstrap to an exact second order expansion of the HAL-TMLE, and to various upper bounds of this exact second order expansion.

2 General formulation of statistical estimation problem and moti-vation for finite sample inference

2.1 Statistical model and target parameter

Let O 1 , , O n be n i.i.d. copies of a random variable O P 0 M . Let P n be the empirical probability measure of O 1 , , O n . Let Ψ : M IR be a real valued parameter that is pathwise differentiable at each P M with canonical gradient D * ( P ) . That is, given a collection of one dimensional submodels { P ϵ S : ϵ } M through P at ϵ = 0 with score S, for each of these submodels the derivative d d ϵ Ψ ( P ϵ S ) | ϵ = 0 can be represented as a covariance E P D ( P ) ( O ) S ( O ) of a gradient D ( P ) with the score S. The latter is an inner product of a gradient D ( P ) L 0 2 ( P ) with the score S in the Hilbert space L 0 2 ( P ) of functions of O with mean zero (under P) endowed with inner product S 1 , S 2 P = P S 2 S 2 . Let | | f | | P f ( o ) 2 d P ( o ) be the Hilbert space norm. Such an element D ( P ) L 0 2 ( P ) is called a gradient of the pathwise derivative of Ψ at P. The canonical gradient D * ( P ) is the unique gradient that is an element of the tangent space defined as the closure of the linear span of the collection of scores generated by this family of submodels. Define the exact second-order remainder

(1) R 2 ( P , P 0 ) Ψ ( P ) Ψ ( P 0 ) + ( P P 0 ) D * ( P ) ,

where ( P P 0 ) D * ( P ) = P 0 D * ( P ) since D * ( P ) has mean zero under P.

Example 1:

(Treatment-specific mean)

Let O = ( W , A , Y ) P 0 M , where A { 0,1 } is a binary treatment, Y { 0,1 } is a binary outcome, and M is a nonparametric model. For a possible data distribution P, let Q ( P ) = E P ( Y | A , W ) be the outcome regression, G ( P ) = P ( A = 1 | W ) be the propensity score, and let Q W ( P ) be the probability distribution of W. The treatment-specific mean parameter is defined by Ψ ( P ) = E P E P ( Y | A = 1 , W ) . Let Q = ( Q , Q W ) and note that the data distribution P is determined by ( Q , G ) . The canonical gradient of Ψ at P is

D * ( P ) = D * ( Q , G ) = I ( A = 1 ) G ( A | W ) ( Y Q ( A , W ) ) + Q ( 1 , W ) Ψ ( Q ) .

The second-order remainder R 2 ( P , P 0 ) Ψ ( P ) Ψ ( P 0 ) + P 0 D * ( P ) is given by:

R 2 ( Q , G , Q 0 , G 0 ) = ( G G 0 ) ( w ) G ( w ) ( Q Q 0 ) ( 1 , w ) d P 0 ( w )

Let Q : M Q ( M ) be a function valued parameter so that Ψ ( P ) = Ψ 1 ( Q ( P ) ) for some Ψ 1 . For notational convenience, we will abuse notation by referring to the target parameter with Ψ ( Q ) and Ψ ( P ) interchangeably. Let G : M G ( M ) be a function valued parameter so that D * ( P ) = D 1 * ( Q ( P ) , G ( P ) ) for some D 1 * . Again, we will use the notation D * ( P ) and D * ( Q , G ) interchangeably.

For each Q Q ( M ) , let L 1 ( Q ) be a function of O so that

Q 0 = arg m i n Q Q ( M ) P 0 L 1 ( Q ) .

Similarly, for each G G ( M ) , let L 2 ( G ) be a function of O so that

G 0 = arg min G G ( M ) P 0 L 2 ( G ) .

We refer to L 1 ( Q ) and L 2 ( G ) as loss functions for Q 0 and G 0 . Let d 01 ( Q , Q 0 ) = P 0 L 1 ( Q ) P 0 L 1 ( Q 0 ) 0 and d 02 ( G , G 0 ) = P 0 L 2 ( G ) P 0 L 2 ( G 0 ) 0 be the loss-based dissimilarities for these two nuisance functions. The loss based dissimilarity is often called the regret. Assume that the loss functions are uniformly bounded in the sense that sup Q Q ( M ) , O | L 1 ( Q ) ( O ) | < and sup G G ( M ) , O | L 2 ( G ) ( O ) | < . In addition, assume

(2) sup Q Q ( M ) P 0 { L 1 ( Q ) L 1 ( Q 0 ) } 2 d 01 ( Q , Q 0 ) < sup G G ( M ) P 0 { L 2 ( G ) L 2 ( G 0 ) } 2 d 02 ( G , G 0 ) < .

This condition holds for most common bounded loss functions (such as mean-squared error loss and cross entropy loss), and it guarantees that the loss-based dissimilarities d 01 ( Q , Q 0 ) and d 02 ( G , G 0 ) behave as a square of an L 2 ( P 0 ) -norm. These two universal bounds on the loss function yield the oracle inequality for the cross-validation selector among a set of candidate estimators [9], [10], [11], [12], [13]. In particular, it establishes that the cross-validation selector is asymptotically equivalent to the oracle selector.

Example 2:

(Treatment-specific mean)

For the treatment-specific mean parameter, the Q function is the outcome regression E ( Y | A , W ) , and G = P ( A = 1 | W ) is the propensity score. The other component Q W of Q will be estimated with the empirical probability measure, which is an NPMLE, so that a TMLE will not update this estimator. Let L 1 ( Q ) ( O ) = { Y log Q ( A , W ) + ( 1 Y ) log ( 1 Q ( A , W ) ) } be the negative log-likelihood loss for the outcome regression. Similarly, L 2 ( G ) is the negative-log-likelihood loss for propensity score. When, for some δ > 0 , G > δ > 0 and δ < Q < 1 δ , then the loss functions are uniformly bounded with finite universal bounds (2).

2.1.1 Donsker class condition

Our formal theorems need to assume that { L 1 ( Q ) : Q Q ( M ) } , { L 2 ( G ) : G G ( M ) } , and { D * ( Q , G ) : Q Q ( M ) , G G ( M ) } are uniform (in P M ) Donsker classes, or, equivalently, that the union F of these classes is a uniform Donsker class. We will also assume that sup f F f < (we already assumed this above for the loss functions) so that we do avoid having to deal with unbounded random variables. We remind the reader that a covering number N ( ϵ , F , L 2 ( Λ ) ) is defined as the minimal number of balls of size ϵ w.r.t. L 2 ( Λ ) -norm that are needed to cover the set F of functions embedded in L 2 ( Λ ) . Let α ( 0 , 1 ) be defined such that

(3) sup Λ log 1 / 2 ( N ( ϵ , F , L 2 ( Λ ) ) = O ( ϵ ( 1 α ) ) .

Our formal results will refer to a rate of convergence of the HAL-MLEs w.r.t. loss based dissimilarity given by n 1 / 2 α / 4 implied by this index α [7]. In this article we will focus here on the following special Donsker class of cadlag functions with a universal bound on the sectional variation norm, in which case α can be chosen as 2 / ( d + 2 ) .

2.1.2 Loss functions and canonical gradient have a uniformly bounded sectional variation norm

We assume that the loss functions and canonical gradient are cadlag functions with a universal bound on the sectional variation norm. The latter class of functions is indeed a uniform Donsker class. In the sequel we will assume this, but we remark here that throughout we could have replaced this class of cadlag functions with a universal bound on the sectional variation norm by any other uniform Donsker class satisfying (3) above. Below we will present a particular class of models M in which we assume that the nuisance parameters Q and G themselves fall in such classes of functions, so that generally also L 1 ( Q ) , L 2 ( G ) and D * ( Q , G ) will fall in this class. All our applications have been covered by the latter type of models.

We will formalize this condition now. Suppose that O [ 0 , τ ] IR 0 d is a d-variate random variable with support contained in a d-dimensional cube [ 0 , τ ] . Let D d [ 0 , τ ] be the Banach space of d-variate real valued cadlag functions endowed with a supremum norm [14]. Let L 1 : Q ( M ) D d [ 0 , τ ] and L 2 : G ( M ) D d [ 0 , τ ] . We assume that these loss functions and the canonical gradient map into functions in D d [ 0 , τ ] with a sectional variation norm bounded by some universal finite constant (we will define sectional variation norm . v * momentarily)

(4) M 1 sup P M L 1 ( Q ( P ) ) v * < , M 2 sup P M | | L 2 ( G ( P ) ) | | v * < , M 3 sup P M D * ( P ) v * < .

Thus, we define F as the union of { D * ( P ) : P M } , { L 1 ( Q ( P ) ) : P M } and { L 2 ( G ( P ) ) : P M } , which is contained in the class of cadlag functions in D d [ 0 , τ ] with a universal bound on the sectional variation norm.

Example 3:

(Treatment-specific mean)

Under the previous stated assumptions, the sectional variation norm of W D * ( Q , G ) ( W , a , y ) (for each ( a , y ) { 0 , 1 } 2 ) can be bounded in terms of the sectional variation norm of W Q ( 1 , W ) and G. Similarly, this same statement applies for L ( Q ) and L 2 ( G ) . As a consequence, the universal bounds (4) are finite.

For a given function F D d [ 0 , τ ] , we define the sectional variation norm as follows. For a given subset s { 1 , , d } , let F s ( x s ) = F ( x s , 0 s ) be the s-specific section of F that sets the coordinates outside the subset s equal to 0, where we used the notation ( x s , 0 s ) for the vector whose j-th component equals x j if j s and 0 otherwise. The sectional variation norm is now defined by

| | F | | v * = | F ( 0 ) | + s { 1 , , d } ( 0 s , τ s ] | d F s ( u s ) | ,

where the sum is over all subsets s of { 1 , , d } . Note that ( 0 s , τ s ] | d F s ( u s ) | is the standard variation norm of the measure d F s generated by its s-specific section F s on the | s | -dimensional edge ( 0 s , τ s ] × { 0 s } of the d-dimensional cube [ 0 , τ ] . Thus, the sectional variation norm of F is the sum of the variation norms of F itself and of all its s-specific sections F s , plus that of the offset | F ( 0 ) | . We also note that any function F D d [ 0 , τ ] with finite sectional variation norm (i.e., | | F | | v * < ) can be represented as follows [2]:

(5) F ( x ) = F ( 0 ) + s { 1 , , d } ( 0 s , x s ] | d F s ( u s ) | .

As utilized in [7] to define the HAL-MLE, since ( 0 s , x s ] d F s ( u s ) = I u s x s d F s ( u s ) , this representation shows that F can be written as an infinitesimal linear combination of tensor product (over s) indicator basis functions x I u s x s indexed by a cut-off u s , across all subsets s, where the coefficients in front of the tensor product indicator basis functions are equal to the infinitesimal increments d F s ( u s ) of F s at u s . This proves that this class of functions can be represented as a “convex” hull of the class of indicators basis functions, which proves that indeed our definition F is a Donsker class [15], and, it follows that its covering number satisfies (3) with α = 2 / ( d + 2 ) .

We could slightly enlarge this class as follows. Define the sectional variation norm where | F ( 0 ) | does not contribute to the value:

| | F | | v * = s { 1 , , d } ( 0 s , τ s ] | d F s ( u s ) |

We can enlarge the functional class F to cadlag functions with finite v * sectional variation norm. This class H can be represented as a shift of F by an unbounded scalar of H = { a + F : a } . As shown in Appendix G the class H has essentially the same covering number as F , so that we could also select this as our Donsker class F . In general, throughout this article, one could replace v * by v * . Since we assumed that the supremum norm of all functions in f is bounded by universal constant, this does not weaken the condition, but, in future work, this observation could be used to extend our work to unbounded functions, in harmony with the cross-validation results [10] for unbounded loss-functions.

For discrete measures F s this integral becomes a finite linear combination of such | s | -way indicator basis functions (where | s | denotes the size of the set s). One could think of this representation of F as a saturated model of a function F in terms of tensor products of univariate indicator basis functions, ranging from products over singletons to product over the full set { 1 , , d } . For a function f D d [ 0 , τ ] , we also define the supremum norm | | f | | = sup x [ 0 , τ ] | f ( x ) | .

2.1.3 General class of models for which parameter spaces for Q and G are Cartesian products of sets of cadlag functions with bounds on sectional variation norm

Although the above bounds M 1 , M 2 , M 3 are the only relevant bounds for the asymptotic performance of the HAL-MLE and HAL-TMLE, for practical formulation of a model M one might prefer to state the sectional variation norm restrictions on the parameters Q and G themselves instead of on L 1 ( Q ) and L 2 ( G ) . (In our formal results we will refer to such a model M as having the extra structure (6) defined below, but, this extra structure is not needed, just as we can work with a general Donsker class as mentioned above.)

For that purpose, a model may assume that Q = ( Q 1 , , Q K 1 ) for variation independent parameters Q k that are themselves m 1 k -dimensional cadlag functions on [ 0 , τ 1 k ] IR 0 m 1 k with sectional variation norm bounded by some upper-bound C Q k u and lower bound C Q k l , k = 1 , , K 1 , and similarly for G = ( G 1 , , G K 2 ) with sectional variation norm bounds C G k u and C G k l , k = 1 , , K 2 . We define two parameters Q 1 and Q 2 are variation independent if { ( Q 1 ( P ) , Q 2 ( P ) ) : P M } = { Q 1 ( P ) : P M } { Q 2 ( P ) : P M } (i.e., tensor product of the parameter spaces of Q 1 and Q 2 ). Typically, such a model would not enforce a lower bound on the sectional variation norm so that we have C Q k l = C G k l = 0 . Let C Q u = ( C Q k u : k = 1 , , K 1 ) ; C Q l = ( C Q k l : k = 1 , , K 1 ) ; and C Q = ( C Q l , C Q u ) , and similarly we define C G u , C G l and C G = ( C G l , C G u ) . Specifically, for such a class of models let

F Q k Q k ( M ) ,

F G k G k ( M ) ,

denote the parameter spaces for Q k and G k , and assume that these parameter spaces F j k are contained in the class F j k n p of m j k -variate cadlag functions with sectional variation norm bounded from above by C j k u and from below by C j k l , k = 1 , , K j , j { Q , G } . These bounds C Q u = ( C Q k u : k ) and C G u = ( C G k u : k ) will then imply bounds M 1 , M 2 , M 3 . For such a model L 1 ( Q ) and L 2 ( G ) would be defined as sums of loss functions: L 1 ( Q ) = k = 1 K 1 L 1 k ( Q k ) and L 2 ( G ) = k = 1 K 2 L 2 k ( G k ) . We also define the vector losses L 1 ( Q ) = ( L 1 k ( Q k ) : k = 1 , , K 1 ) , L 2 ( G ) = ( L 2 k ( G k ) : k = 1 , , K 2 ) , and corresponding vector dissimilarities d 01 ( Q , Q 0 ) = ( d 01 , k ( Q k , Q k 0 ) : k = 1 , , K 1 ) and d 02 ( G , G 0 ) = ( d 02 , k ( G k , G k 0 ) : k = 1 , , K 2 ) .

For example, the parameter space F j k of Q k ( j = Q ) or G k ( j = G ) may be defined as

(6) F j k , A j k n p { F F j k n p : d F s ( u s ) = I ( s , u s ) A j k d F s ( u s ) , s { 1 , , m j k } } ,

for some set A j k of possible values for ( s , u s ) , k = 1 , , K j , j { Q , G } , where one evaluates this restriction on F in terms of the representation (5). Note that we used short-hand notation g ( x ) = I x A g ( x ) for g being zero for x A . We will make the convention that if A excludes { 0 } , then it corresponds with assuming F ( 0 ) = 0 .

The subset F Q k , A Q k n p of cadlag functions F Q k n p with sectional variation norm between C Q k l and C Q k u further restricts the support of these functions to a set A Q k . For example, A Q k might set d F s = 0 for subsets s of size larger than 3 for all values u s ( 0 s , τ s ] , in which case the model assumes that the nuisance parameter Q k can be represented as a sum over all subsets s of size 1,2 and 3 of a function of the variables indicated by s.

In order to allow modeling of monotonicity (e.g., nuisance parameter Q k is an actual cumulative distribution function), we also allow that this set restricts d F s ( u s ) 0 for all ( s , u s ) A j k . We will denote the latter parameter space with

(7) F j k , A j k n p , + = { F F j k n p : d F s ( u s ) = I ( s , u s ) A j k d F s ( u s ) , d F s 0 , F ( 0 ) 0 , s } .

For the parameter space (7) of monotone functions we allow that the sectional variation norm is known by setting C j k u = C j k l (e.g., for the class of cumulative distribution functions we would have C j k u = C j k l = 1 ), while for the parameter space (6) of cadlag functions with sectional variation norm between C j k l and C j k u we assume C j k l < C j k u .

For the analysis of our proposed nonparametric bootstrap sampling distributions we do not assume this extra model structure that F j k = F j k , A j k n p or F j k = F j k , A j k n p , + for some set A j k , k = 1 , , K j , j { Q , G } . In the sequel we will refer to a model with this extra structure as a model satisfying (6), even though we include the case (7). All our formal results apply without this extras model structure (and also for any other uniform Donsker class as mentioned above), but it just happens to represent a natural model structure for establishing the sectional variation norm bounds (4) on L 1 ( Q ) , L 2 ( G ) , and D * ( Q , G ) , and for computing HAL-MLEs. The key practical benefit of this extra model structure is that the implementation of the HAL-MLE for such a parameter space F j k , A j k n p corresponds with fitting a linear combination of indicator basis functions of the form I u s x s (indexed by a subset s and knot-point u s ) under the sole constraint that the sum of the absolute value of the coefficients is bounded by C j k l and C j k u , and possibly that the coefficients are non-negative, where the set A j k implies the set of indicator basis functions that are included. Specifically, in the case that the nuisance parameter is a conditional mean or conditional probability we can compute the HAL-MLE with standard lasso linear or logistic regression software [8]. Therefore, this restriction on our set of models also allows straightforward computation of its HAL-MLEs, corresponding HAL-TMLE, and their bootstrap analogs.

A typical statistical model assuming the extra structure (6) would be of the form M = { P : Q k 1 ( P ) F Q k 1 , A Q k 1 n p , G k 2 ( P ) F G k 2 , A G k 2 n p , k 1 , k 2 } indexed by the support sets ( ( A Q k 1 , A G k 2 ) : k 1 , k 2 ) and the sectional variation norm bounds ( ( C j k l , C j k u ) : j , k ) , but the model M might include additional restrictions on P as long as the parameter spaces of these nuisance parameters equal these sets F j k j , A j k j n p or F j k j , A j k j n p , + .

Remark 2.1

(Creating parameter spaces of type (6) or (7)) In our first example we have a nuisance parameter G ( W ) = E P ( A | W ) that is not just assumed to be cadlag and have bounded sectional variation norm but is also bounded between δ and 1 δ for some δ > 0 . This means that the parameter space for this G is not exactly of type (6). This is easily resolved by, for example, reparameterizing G ( W ) = e x p i t ( G ( W ) ) where G can be any cadlag function with sectional variation norm bounded by some constant C u . The bound C u implies automatically a supremum norm bound on G, and thereby that δ < G < 1 δ for some δ = δ ( C u ) > 0 . One now defines the nuisance parameter as G. Similarly, such a parametrization can be applied to E ( Y | A , W ) and to the density in our second example. These just represent a few examples showcasing that one can reparametrize the natural nuisance parameters in terms of nuisance parameters that have a parameter space of the form (6) or (7). These representations are actually natural steps for the implementation of the HAL-MLE since they allow us now to minimize the empirical risk over a generalized linear model with the sole constraint that the sum of absolute value of coefficients is bounded (and possibly coefficients are non-negative).

2.1.4 Bounding the exact second-order remainder in terms of loss-based dissimilarities

Let

R 2 ( P , P 0 ) = R 20 ( Q , G , Q 0 , G 0 )

for some mapping R 20 ( ) = R 2 P 0 ( ) possibly indexed by P 0 . We assume the following upper bound:

(8) | R 2 ( P , P 0 ) | = | R 20 ( Q , G , Q 0 , G 0 ) | f ( d 01 1 / 2 ( Q , Q 0 ) , d 02 1 / 2 ( G , G 0 ) )

for some function f : IR 0 K IR 0 , K = K 1 + K 2 , of the form f ( x ) = i , j a i j x i x j , a quadratic polynomial with positive coefficients a i j 0 . In all our examples, one simply uses the Cauchy–Schwarz inequality to bound R 20 ( P , P 0 ) in terms of L 2 ( P 0 ) -norms of Q k 1 Q k 1 0 and G k 2 G k 2 0 , and subsequently one relates these L 2 ( P 0 ) -norms to its loss-based dissimilarities d 01 , k 1 ( Q k 1 , Q k 1 0 ) and d 02 , k 2 ( G k 2 , G k 2 0 ) , respectively. This bounding step will also rely on an assumption that denominators in R 20 ( P , P 0 ) are uniformly bounded away from zero. This type of assumption that guarantees uniform bounds on D * ( Q , G ) and on R 20 ( Q , G , Q 0 , G 0 ) is often referred to as a strong positivity assumption since it requires that the data density has a certain type of support relevant for the target parameter Ψ , and that the data density is uniformly bounded away from zero on that support. In the treatment specific mean example, a common case where the strong positivity assumption does not hold is if G 0 ( A = 1 | W ) = 0 for a some value of W.

2.1.5 Continuity of efficient influence curve as function of P at P 0

We also assume that if the rates of convergence of d 01 ( Q n , Q 0 ) and d 02 ( G n , G 0 ) translate in the same rate of convergence of P 0 { D * ( Q n , G n ) D * ( Q 0 , G 0 ) } 2 . This is guaranteed by the following upper bound:

(9) P 0 { D * ( Q , G ) D * ( Q 0 , G 0 ) } 2 f ( d 01 1 / 2 ( Q , Q 0 ) , d 02 1 / 2 ( G , G 0 ) )

for some function f : IR 0 K IR 0 , K = K 1 + K 2 , of the form f ( x ) = i , j a i j x i x j , a quadratic polynomial with positive coefficients a i j 0 .

2.2 HAL-MLEs of nuisance parameters

We estimate Q 0 , G 0 with HAL-MLEs Q n , G n satisfying (with probability tending to 1)

P n L 1 ( Q n ) P n L 1 ( Q 0 ) ,

P n L 2 ( G n ) P n L 2 ( G 0 ) .

For example, Q n might be defined as the actual minimizer Q n = arg min Q Q ( M ) P n L 1 ( Q ) . If Q has multiple components and the loss function is a corresponding sum loss function, then these HAL-MLEs correspond with separate HAL-MLEs for each component. We have the following previously established result from Lemma 3 in van der Laan [7] for these HAL-MLEs. We represent estimators as mappings on the nonparametric model M n p containing all possible realizations of the empirical measure P n .

Lemma 1

(Lemma 3 from van der Laan [7]) Let O P 0 M . Let Q : M Q ( M ) be a function valued parameter and let L : Q ( M ) D d [ 0 , τ ] be a loss function so that Q 0 Q ( P 0 ) = arg min Q Q ( M ) P 0 L ( Q ) . Let Q ^ : M np Q ( M ) define an estimator Q n Q ^ ( P n ) so that P n L 1 ( Q n ) = min Q Q ( M ) P n L ( Q ) or P n L 1 ( Q n ) P n L 1 ( Q 0 ) . Let d 0 ( Q , Q 0 ) = P 0 L ( Q ) P 0 L ( Q 0 ) be the loss-based dissimilarity. Then

d 0 ( Q n , Q 0 ) ( P n P 0 ) { L ( Q n ) L ( Q 0 ) } .

If sup Q Q ( M ) L ( Q ) v * < , and (2) holds for L 1 ( Q ) , then

E 0 d 0 ( Q n , Q 0 ) = O ( n 1 / 2 α / 4 ) ,

where α is defined as in (3) for class { L 1 ( Q ) : Q Q ( M ) } .

Application of this general lemma proves that d 01 ( Q n , Q 0 ) = O P ( n 1 / 2 α / 4 ) and d 02 ( G n , G 0 ) = O P ( n 1 / 2 α / 4 ) .

One can add restrictions to the parameter space Q ( M ) over which one minimizes in the definition of Q n and G n as long as one guarantees that, with probability tending to 1, P n L 1 ( Q n ) P n L 1 ( Q 0 ) and P n L 2 ( G n ) P n L 2 ( G 0 ) . For example, in a model M with extra structure (6) this allows one to use a data dependent upper bound C Q n u C 1 u on the sectional variation norm in the definition of Q n if we know that C Q n u will be larger than the true C Q 0 u = | | Q 0 | | v * with probability tending to 1.

2.3 HAL-TMLE

Consider a finite dimensional local least favorable model { Q n , ϵ : ϵ } Q ( M ) through Q n at ϵ = 0 so that the linear span of the components of d d ϵ L 1 ( Q n , ϵ ) at ϵ = 0 includes D * ( Q n , G n ) . Let Q n * = Q n , ϵ n for ϵ n = arg min ϵ P n L 1 ( Q n , ϵ ) . We assume that this one-step TMLE Q n * already satisfies

(10) r n P n D * ( Q n * , G n ) = o P ( n 1 / 2 ) .

Since d 01 ( Q n , Q 0 ) = o P ( n 1 / 2 ) we will have that ϵ n = o P ( n 1 / 4 ) , and ϵ n solves its score equation d / d ϵ n P n L 1 ( Q n , ϵ n ) = 0 , which, in first order, equals its score equation P n D * ( Q n , ϵ n , G n ) at ϵ = 0 (with a second order remainder O ( ϵ n 2 ) = o P ( n 1 / 2 ) ). This basic argument allows one to prove that (10) holds under the assumption d 01 ( Q n , Q 0 ) = o P ( n 1 / 2 ) and regularity conditions, as formally shown in the Appendix of [7]. Alternatively, one could use the one-dimensional canonical universal least favorable model satisfying d / d ϵ L 1 ( Q n , ϵ ) = D * ( Q n , ϵ , G n ) at each ϵ (see our second example in Section 5). In that case, the efficient influence curve equation (10) is solved exactly with the one-step TMLE: i.e., r n = 0 [16]. The HAL-TMLE of ψ 0 is the plug-in estimator ψ n * = Ψ ( Q n * ) . In the context of model structure (6) (or (7)), we will also refer to this estimator as the HAL-TMLE ( C u ) to indicate its dependence on the specification of the bounds C u = ( C Q u , C G u ) on the sectional variation norms of the components of Q and G.

Lemma 2 in Appendix A proves that d 01 ( Q n , ϵ n , Q 0 ) converges at the same rate as d 01 ( Q n , Q 0 ) = O P ( n 1 / 2 α / 4 ) (see (22)). This also implies this result for any K-th step TMLE with K fixed. The advantage of a one-step or K-th step TMLE is that it is always well defined, and it easily follows that it converges at the same rate as the initial Q n to Q 0 . In addition, for these closed form TMLEs it is also guaranteed that the sectional variation norm of Q n * remains universally bounded. The latter is important for the Donsker class condition for asymptotic efficiency of the HAL-TMLE, but the Donsker class condition could be avoided by using a cross-validated HAL-TMLE that relies on sample splitting [5].

Assuming extra model structure (6), since we apply the least favorable submodel to an HAL-MLE Q n that is likely having the maximal allowed C 1 u sectional variation norm, the following remark is in order. We suggest to simply extend the statistical model by enlarging the sectional variation norm bounds to C 1 u + δ for some δ > 0 , even though the original bounds C 1 u are still used in the definition of the HAL-MLEs. This increase in statistical model does not change the canonical gradient at P 0 (known to be an element of the interior of original model), while now a least favorable submodel through the HAL-MLE is allowed to enlarge the sectional variation norm. This makes the construction of a least favorable submodel easier by not having to worry to constrain the sectional variation norm. Since the HAL-MLE Q n has the maximal allowed uniform sectional variation norm C Q u , and Q n is consistent, the sectional variation norm of the TMLE Q n * = Q n , ϵ n will now be slightly larger, and asymptotically approximate C 1 u . Either way, with the slightly enlarged definition of M , we have { Q n , ϵ : ϵ } M so that the assumption (4) guarantees that | | L 1 ( Q n , ϵ n ) | | v * is bounded by a universal constant.

Example 4:

(Treatment-specific mean)

Condition (8) holds by applying the Cauchy–Schwarz inequality, and using G > δ > 0 for some δ > 0 . The HAL-MLEs Q n and G n of Q and G, respectively, can be computed with a lasso-logistic regression estimator with large (approximately n 2 d ) number of indicator basis functions (see our example section for more details), where we can select the L 1 -norm of the coefficient vector with cross-validation. The least favorable submodel through Q n is given by

(11) log it Q n , ε = log it Q n + ε C ( G n ) ,

where C ( G n ) ( A , W ) A / G n ( W ) . Let ε n arg min ε P n L 1 ( Q n , ε ) , which is thus computed with a simple univariate logistic regression MLE, using as off-set logit Q n . This defines the TMLE Q n * = Q n , ϵ n . Recall that Q W , n is already an NPMLE so that a TMLE-update based on a log-likelihood loss and local least favorable submodel (i.e., with score Q n ( W ) Ψ ( Q n ) , will not change this estimator. Let Q n * = ( Q W , n , Q n * ) . The HAL-TMLE of ψ 0 is the plug-in estimator ψ n * Ψ ( Q n * ) = 1 / n i = 1 n Q n * ( 1 , W i ) .

2.4 Asymptotic efficiency theorem for HAL-TMLE and CV-HAL-TMLE

Lemma 1 establishes that d 01 ( Q n , Q 0 ) and d 02 ( G n , G 0 ) are O P ( n 1 / 2 α / 4 ) . Lemma 2 in Appendix A proves that also d 01 ( Q n * , Q 0 ) = O P ( n 1 / 2 α / 4 ) . Combined with (8), this shows that the second-order term R 20 ( Q n * , G n , Q 0 , G 0 ) = O P ( n 1 / 2 α / 4 ) .

We have the following identity for the HAL-TMLE:

(12) Ψ ( Q n * ) Ψ ( Q 0 ) = ( P n P 0 ) D * ( Q n * , G n ) + R 20 ( Q n * , G n , Q 0 , G 0 ) + r n

(13) = ( P n P 0 ) D * ( Q 0 , G 0 ) + ( P n P 0 ) { D ( Q n * , G n ) D * ( Q 0 , G 0 ) } + R 20 ( Q n * , G n , Q 0 , G 0 ) + r n .

The second term on the right-hand side is O P ( n 1 / 2 α / 4 ) following the same empirical process theory proof as Theorem 1 in van der Laan [17] using the continuity condition (9) on D * . Thus, this proves the following asymptotic efficiency theorem.

Theorem 1

Consider the statistical model M and target parameter Ψ : M IR ssatisfying (2), (4), (8), (9). Let Q n , G n be the above defined HAL-MLEs, where d 01 ( Q n , Q 0 ) and d 02 ( G n , G 0 ) are O P ( n 1 / 2 α / 4 ) . Let Q n * = Q n , ϵ n be the one-step TMLE-update according to a submodel { Q n , ϵ : ϵ } M solving the efficient influence curve equation such that (10) holds.

Then the HAL-TMLE Ψ ( Q n * ) of ψ 0 is asymptotically efficient

(14) Ψ ( Q n * ) Ψ ( Q 0 ) = P n D * ( Q 0 , G 0 ) + O P ( n 1 / 2 α / 4 ) .

We remind the reader that the condition (4), stating that the loss functions and canonical gradient are contained in class of cadlag functions with a universal bound on the sectional variation norm, can be replaced by a general Donsker class condition (3). We also remark that this Theorem 1 trivially generalizes to any rate of convergence for d 01 ( Q n , Q 0 ) and d 02 ( G n , G 0 ) by simply setting the remainder term in (14) equal to this same rate. Due to a recent new result in [18] for the HAL-MLE, utilizing an improved recently published covering number bound, under specified conditions, we now even have d 01 ( Q n , Q 0 ) = O P ( n 2 / 3 ( log n ) d ) and d 02 ( G n , G 0 ) = O P ( n 2 / 3 ( log n ) d ) . So, applying this result would yield (14) with remainder O P ( n 2 / 3 ( log n ) d ) .

2.4.1 Wald type confidence interval

A first order asymptotic 0.95-level confidence interval is given by ψ n * ± 1.96 σ n / n 1 / 2 where σ n 2 = P n { D * ( Q n * , G n ) } 2 is a consistent estimator of σ 0 2 = P 0 { D * ( Q 0 , G 0 ) } 2 . Clearly, this first order confidence interval ignores the exact remainder R ˜ 2 n in the exact expansion Ψ ( Q n * ) Ψ ( Q 0 ) = ( P n P 0 ) D * ( Q 0 , G 0 ) + R ˜ 2 n as presented in (13):

(15) R ˜ 2 n R 20 ( Q n * , G n , Q 0 , G 0 ) + ( P n P 0 ) { D * ( Q n * , G n ) D * ( Q 0 , G 0 ) } + r n .

Let’s consider the extra model structure (6). The asymptotic efficiency proof above of the HAL-TMLE ( C u ) relies on the HAL-MLEs ( Q n , C Q u , G n , C G u ) converging to the true ( Q 0 , G 0 ) at rate faster than n 1 / 4 , and their sectional variation norm being uniformly bounded from above by C u = ( C Q u , C G u ) . Both of these conditions are still known to hold for the CV-HAL-MLE ( Q n , C Q n , G n , C G n ) in which the constants ( C Q , C G ) are selected with the cross-validation selector C n = ( C Q n , C G n ) [7]. This follows since the cross-validation selector is asymptotically equivalent to the oracle selector, thereby guaranteeing that C n will exceed the sectional variation norm of the true ( Q 0 , G 0 ) with probability tending to 1. Typically, one will only data adaptively select C u , while keeping C l = ( C Q l , C G l ) at its known lower bound. Therefore, we have that this CV-HAL-TMLE is also asymptotically efficient. Of course, this CV-HAL-TMLE is more practical and powerful than the HAL-TMLE at an apriori specified C = ( C Q , C G ) = ( C Q u , C Q l , C G u , C G l ) since it adapts the choice of bounds C = ( C Q , C G ) to the true sectional variation norms C 0 = ( C Q 0 , C G 0 ) for ( Q 0 , G 0 ) .

For simplicity, in the next theorem we focus on data adaptive selection of C u only.

Theorem 2

Consider the setting of Theorem 1 , but with the extra model structure (6). Let C Q 0 u = Q v * , C G 0 u = Q v * . Suppose that C Q u and C G u that define the HAL-MLEs Q n = Q n , C Q u and G n = G n , C G u are replaced by data adaptive selectors C Q n u and C G n u for which

(16) P 0 ( C Q 0 u C Q n u C Q u , C G 0 u C G n u C G u ) 1 ,   a s n .

Then, under the same assumptions as in Theorem 1 , the TMLE Ψ ( Q n * ) , using Q n = Q n , C Q n u and G n = G n , C G n u as initial estimators, is asymptotically efficient.

In general, when the model M = M ( C ) is defined by global constraints C, then one should use cross-validation to select these constraints C, which will only improve the performance of the initial estimators and corresponding TMLE, due to its asymptotic equivalence with the oracle selector. So our model M satisfying (4) and the extra structure (6) might have more global constraints beyond C u = ( C Q u , C G u ) and these could then also be selected with cross-validation resulting in a CV-HAL-MLE and corresponding HAL-TMLE (see also our two examples).

3 The nonparametric bootstrap for the HAL-TMLE

Let Q 1 # , , O n # be n i.i.d. draws from the empirical measure P n . Let P n # be the empirical measure of this bootstrap sample.

3.1 Definition of bootstrapped HAL-MLEs for model with extra structure (6)

In this subsection, we will assume the extra structure (6) so that our parameter spaces for Q and G consists of cadlag functions with a universal bound C u on the sectional variation norm, thereby allowing us specific computational friendly definitions of the bootstrapped HAL-MLEs. We generalize the definition of Q being absolutely continuous w.r.t. Q n : Q Q n .

Definition 1

Recall the representation (5) for a multivariate real valued cadlag function F in terms of its sections F s . Assume the extra model structure (6) on M . We will say that Q k is absolutely continuous w.r.t. Q k , n if for each subset s { 1 , , m 1 k } , its s-specific section Q k , s defined by u s Q k ( u s , 0 s ) is absolutely continuous w.r.t. Q n , k , s defined by u s Q n , k ( u s , 0 s ) . We use the notation Q k Q n , k . In addition, we use the notation Q Q n if Q k Q n , k for each component k { 1 , , K 1 } . Similarly, we use this notation G G n if G k G n , k for each component k { 1 , , K 2 } .

In practice, the HAL-MLE Q n = arg  min Q Q ( M ) P n L 1 ( Q ) is attained (or simply defined as a minimum among all discrete measures with fine enough selected support) by a discrete measure Q n so that it can be computed by minimizing the empirical risk over a large linear combination of indicator basis functions (e.g., 2 m 1 k n for Q n k ) under the constraint that the sum of the absolute value of the coefficients is bounded by the specified constant C Q [8]. In that case, the constraint Q Q n states that Q is a linear combination of the indicator basis functions that had a non-zero coefficient in Q n .

Let

Q n # = arg min Q Q ( M ) , Q Q n , Q v * Q n v * P n # L 1 ( Q ) ,

G n # = arg min G G ( M ) , G G n , G v * Q n v * P n # L 2 ( G )

be the corresponding HAL-MLEs of Q n = arg min Q Q ( M ) P n L 1 ( Q ) and G n = arg min G G ( M ) P n L 2 ( G ) based on the bootstrap sample. Here P n # is the empirical probability measure that puts mass 1 / n on each observation O i # from a bootstrap sample Q 1 # , , Q n # representing n i.i.d. draws from P n . Since our results on the rates of convergence of Q n # and G n # to Q n and G n only rely on P n # L 1 ( Q n # ) P n # L 1 ( Q n ) (and similarly for G n # ) , the additional restrictions Q Q n and | | Q | | v * | | Q n | | v * are appropriate theoretically. In addition, the extra restriction Q Q n makes the computation of the HAL-MLE on the bootstrap sample much faster than the HAL-MLE Q n based on the original sample, so that enforcing this extra constraint is only beneficial from a computational point of view. That is, the computation of Q n # only involves minimizing the empirical risk w.r.t. P n # over the coefficients that were non-zero in the Q n -fit. Given our experience that a typical HAL-MLE fit has around n non-zero coefficients, this makes the calculation of Q n # across many bootstrap samples computationally feasible. Additionally In Appendix F Figure 6, we include two empirical simulation results on the number of non-zero coefficients as a function of sample size.

The above bootstrap distribution depends on the bounds C = ( C Q , C G ) enforced in the HAL-MLEs ( Q n , G n ) . One possible choice is to set C = ( C Q , C G ) equal to the cross-validation selector C n , c v , where we typically only adaptively select the upper bound C u so that C n , c v = ( C n , c v u , C l ) . In the next section we discuss an alternative (so called plateau) selector C n u for C u that aims to improve finite sample coverage. Either way, in the bootstrap distribution the choice C ( = C n , c v or = C n ) is treated as fixed, although we will evaluate the bootstrap distribution for a range of C values to determine the plateau selector C n u .

3.2 Definition of bootstrapped HAL-MLE in general

In general, Q n # Q ( M ) and G n # G ( M ) have to be defined as estimators of Q n and G n based on the bootstrap sample P n # satisfying that, with probability tending to 1 (conditional on P n ), P n # L 1 ( Q n # ) P n # L 1 ( Q n ) and P n # L 2 ( G n # ) P n # L 2 ( G n ) . For example, Q n # = arg min Q Q ( M ) P n # L 1 ( Q ) and G n # = arg min G G ( M ) P n # L 2 ( G ) , but one is allowed to add restrictions to the parameter space over which one minimizes Q P n # L 1 ( Q ) as long as this space still includes Q n with probability tending to 1 (and similarly for G n # ).

3.3 Bootstrapped HAL-TMLEs

Let ϵ n # = arg min ϵ P n # L 1 ( Q n , ϵ # ) be the one-step TMLE update of Q n # based on the least favorable submodel { Q n , ϵ # : ϵ } through Q n # at ϵ = 0 with score D * ( Q n # , G n # ) at ϵ = 0 . Let Q n # * = Q n , ϵ n # # be the TMLE update which is assumed to solve

(17) r n # | P n # D * ( Q n # * , G n # ) | = o P n ( n 1 / 2 ) ,

conditional on ( P n : n 1 ) (just like r n = o P ( n 1 / 2 ) ). Let Ψ ( Q n # * ) be the resulting TMLE of Ψ ( Q n * ) based on this nonparametric bootstrap sample. Let σ n 2 be an estimate of the asymptotic variance σ 0 2 = P 0 D * ( Q 0 , G 0 ) 2 , such as σ n 2 = P n D * ( Q n , G n ) 2 . Let σ n # 2 be this estimator applied to P n # . We estimate the finite sample distribution of n 1 / 2 ( Ψ ( Q n * ) Ψ ( Q 0 ) ) / σ n with the sampling distribution of Z n 1 , # n 1 / 2 ( Ψ ( Q n # * ) Ψ ( Q n * ) ) / σ n # , conditional on P n . Let Φ n # ( x ) = P ( n 1 / 2 ( Ψ ( Q n # * ) Ψ ( Q n * ) ) / σ n # x | P n ) be the cumulative distribution of this bootstrap sampling distribution. So a bootstrap based 0.95-level confidence interval for ψ 0 is given by

[ ψ n * + q 0.025 , n # σ n / n 1 / 2 , ψ n * + q 0.975 , n # σ n / n 1 / 2 ] ,

where q p , n # = Φ n # 1 ( p ) is the p-th quantile of this bootstrap distribution. We note that the upper bounds | | Q n | | v * and | | G n | | v * on the sectional variation norms of Q n # and G n # , or equivalently, the upper bounds C Q n u and C G n u in the definition of the HAL-MLEs Q n and G n , will impact the values of these quantiles q 0.025 , n # and q 0.975 , n # . That is, the larger these values, the larger the finite dimensional models for Q n and G n implied by their non-zero coefficients, and thereby the larger the variation of the resulting TMLE Ψ ( Q n # * ) . Our results apply for any data adaptive selector C n satisfying that, with probability tending to 1, C Q n u is larger than | | Q 0 | | v * and smaller than C Q u , and similarly for C Q n u . However, clearly, the finite sample coverage of the resulting bootstrap confidence interval is affected by the precise choice C n u = ( C Q n u , C G n u ) .

We now want to prove that n 1 / 2 ( Ψ ( Q n # * ) Ψ ( Q n * ) ) , conditional on P n , converges in distribution to N ( 0 , σ 0 2 ) , and thereby also that Φ n # converges to the cumulative distribution function of limit distribution N ( 0 , 1 ) . Importantly, this nonparametric bootstrap confidence interval could potentially dramatically improve the coverage relative to using the first order Wald-type confidence interval since this bootstrap distribution is estimating the variability of the full-expansion of the TMLE, including the exact remainder R ˜ 2 n .

In the next subsection we show that the nonparametric bootstrap works for the HAL-MLEs Q n and G n . Subsequently, not surprisingly, we can show that this also establishes that the bootstrap works for the one-step TMLE Q n * (K-th step TMLE for fixed K). This provides then the basis for proving that the nonparametric bootstrap is consistent for the HAL-TMLE.

3.4 Nonparametric bootstrap for HAL-MLE

The following theorem establishes that the bootstrap HAL-MLE Q n # estimates Q n as well, w.r.t. an empirical loss-based dissimilarity d n 1 ( Q n # , Q n ) = P n L 1 ( Q n # ) P n L 1 ( Q n ) , as Q n estimates Q 0 with respect to d 01 ( Q n , Q 0 ) = P 0 L 1 ( Q n ) P 0 L 1 ( Q 0 ) . In fact, we even have d 01 ( Q n # , Q 0 ) = O P ( n 1 / 2 α / 4 ) . The analog results apply to G n # .

Theorem 3

Assume (2) and (4).

Definitions: Let d n 1 ( Q , Q n ) = P n { L 1 ( Q ) L 1 ( Q n ) } be the loss-based dissimilarity at the empirical measure, where Q n is an HAL-MLE of Q 0 satisfying P n L 1 ( Q n ) P n L 1 ( Q 0 ) . Similarly, let d n 2 ( G , G n ) = P n { L 2 ( G ) L 2 ( G n ) } be the loss-based dissimilarity at the empirical measure, where G n is an HAL-MLE of G 0 satisfying P n L 2 ( G n ) P n L 2 ( G 0 ) .

Conclusion: Then,

d n 1 ( Q n # , Q n ) = O P ( n 1 / 2 α / 4 )   a n d   d n 2 ( G n # , G n ) = O P ( n 1 / 2 α / 4 ) .

We also have

d 01 ( Q n # , Q 0 ) = O P ( n 1 / 2 α / 4 )   a n d   d 02 ( G n # , G 0 ) = O P ( n 1 / 2 α / 4 ) .

Bootstrapping HAL-MLE ( C ) at C u = C n u for model with extra structure (6): This result also applies to the case that C u = ( C Q u , C G u ) in definition of HAL-MLEs ( Q n , G n ) is replaced by a data adaptive choice C n u satisfying (16) (which is fixed under the bootstrap distribution).

The proof of Theorem 3 is presented in Appendix B. In Appendix B we first establish that d n 1 ( Q n # , Q n ) = O P ( n 1 / 2 α / 4 ) , and we use that, in combination with d 10 ( Q n , Q 0 ) = O P ( n 1 / 2 α / 4 ) , this also implies d 01 ( Q n # , Q 0 ) = O P ( n 1 / 2 α / 4 ) . Thus, clearly, d n 1 ( Q n # , Q n ) is an equally powerful dissimilarity as d 01 ( ) . In fact, assuming that M has the extra model structure (6), in Appendix D we also explicitly show that d n 1 ( Q n # , Q n ) dominates a specified quadratic dissimilarity.

Note that if C u = C n u , then conditional on P n , C n u is still fixed, so that establishing the last result in Theorem 3 only requires checking that the proof of the stated convergence of the bootstrapped HAL-MLE ( Q n , C Q u # , G n , C G u # ) to the HAL-MLE ( Q n , C Q u , G n , C G u ) at a fixed C u = ( C Q u , C G u ) w.r.t. the loss-based dissimilarities d n 1 and d n 2 holds uniformly in C u between the true sectional variation norms C 0 u and the model upper bound C u . The validity of this result does not even rely on C n u exceeding C 0 u , but the latter is needed for establishing that the HAL-MLE Q n , C n u is consistent for Q 0 and thus the efficiency of the HAL-TMLE Ψ ( Q n * ) .

3.5 Preservation of rate of convergence for the targeted bootstrap estimator

In Appendix C we prove that d 01 ( Q n # * , Q 0 ) = O P ( n 1 / 2 α / 4 ) , under the same conditions as assumed in our general Theorem 4.

3.6 The nonparametric bootstrap for the HAL-TMLE

We can now imitate the efficiency proof for the HAL-TMLE to obtain the desired result for the bootstrapped HAL-TMLE of Ψ ( Q n * ) . By Theorem 3, under the assumptions of Theorem 1 for asymptotic efficiency of the TMLE, we have that all five terms d n 1 ( Q n # , Q n ) , d 01 ( Q n # , Q 0 ) , d 01 ( Q n # * , Q 0 ) , d n 2 ( G n # , G n ) , d 02 ( G n # , G 0 ) are O P ( n 1 / 2 α / 4 ) . For a model with extra structure (6), we consider the bootstrap for a data adaptive selector C n u = ( C Q n u , C G n u ) satisfying (16). A general model M might also be indexed by a universal bound C for some quantity C ( P ) for any P M , which could then also be data adaptively selected as long as it satisfies (16) with C 0 = C ( P 0 ) .

Theorem 4

Assumptions: Consider the statistical model M and target parameter Ψ : M IR satisfying (2), (4), (8), (9). Consider the above defined HAL-MLEs Q n , G n satisfying, with probability tending to 1, P n L 1 ( Q n ) P n L 1 ( Q 0 ) and P n L 2 ( G n ) P n L 2 ( G 0 ) . Consider also the above defined bootstrapped HAL-MLEs Q n # , G n # satisfying, with probability tending to 1, conditional on ( P n : n 1 ) , P n # L 1 ( Q n # ) P n L 1 ( Q n ) and P n # L 2 ( G n # ) P n # L 2 ( G n ) . Consider the HAL-TMLE Q n # * = Q n , ϵ n # # and assume (17) r n # = P n # D * ( Q n # * , G n # ) = o P ( n 1 / 2 ) .

TMLE is efficient: The standardized TMLE is asymptotically efficient: Z n 1 n 1 / 2 ( Ψ ( Q n * ) Ψ ( Q 0 ) ) d N ( 0 , σ 0 2 ) , where σ 0 2 = P 0 D * ( Q 0 , G 0 ) 2 .

Bootstrapped HAL-MLE: d 01 ( Q n # , Q n ) = O P ( n 1 / 2 α / 4 ) , d 02 ( G n # , G 0 ) = O P ( n 1 / 2 α / 4 ) and d 01 ( Q n # * , Q 0 ) = O P ( n 1 / 2 α / 4 ) .

Bootstrapped HAL-TMLE: Conditional on ( P n : n 1 ) , the bootstrapped TMLE is asymptotically linear:

Ψ ( Q n # * ) Ψ ( Q n ) = ( P n # P n ) D * ( Q n , G n ) + O P ( n 1 / 2 α / 4 ) .

As a consequence, conditional on ( P n : n 1 ) , the standardized bootstrapped TMLE converges to N ( 0 , σ 0 2 ) : Z n 1 , # n 1 / 2 ( Ψ ( Q n # * ) Ψ ( Q n * ) ) d N ( 0 , σ 0 2 ) .

Consistency of the nonparametric bootstrap for HAL-TMLE at data adaptive selector C n u : Assume the extra model structure (6) on M , and its corresponding definitions of the HAL-MLEs indexed by sectional variation norm bounds C = ( C u , C l ) . This theorem can be applied to the bootstrap distribution at a data adaptive C n = ( C n u , C n l ) satisfying (16).

The proof of this theorem is presented in Appendix D.

4 Finite sample modifications of the nonparametric bootstrap dis-tribution for model with extra structure (6)

In this section we focus on the case that the model M satisfies the extra structure (6). The finite sample modifications proposed here are evaluated in our simulation study in Section 6 for our two examples. The nuisance parameter estimates Q n and G n are key inputs of the HAL-TMLE bootstrap. The HAL estimations of these nuisance parameters depend largely on the selection of the upper bound of the sectional variation norm C u = ( C Q u , C G u ) . We will focus on a data adaptive selector of C Q n u (replacing C Q u ), for a given selector C G n u , where the latter is chosen to be the cross-validation selector. Since our target parameter is a function of Q only, we suggest that the selection of C Q n u is fundamentally more important than C G n u , and also creates enough room for our desired finite sample adjustment of the nonparametric bootstrap. In the software implementation of LASSO, the L 1 -norm constraint C Q u is translated into a penalized empirical risk with L 1 -penalty hyper-parameter λ, where a choice of C Q u corresponds with a unique choice λ. In the sequel, we will propose a selector of λ, and thereby of C Q u .

Ideally, we want to set C Q u = C Q 0 u equal to the sectional variation norm of Q 0 , so that the bootstrap model for the HAL-MLE Q n # is large enough for unbiased estimation of Q n . Due to the asymptotic equivalence of the cross-validation selector C Q n , C V u with the oracle selector that optimizes the loss-based dissimilarity, the cross-validation selector C Q n , C V u will approximate C Q 0 u as sample size increases. However, in finite samples, when the true sectional variation norm C Q 0 u of Q 0 is large ( λ 0 is small), the cross-validation selector C Q n , C V u will tend to be smaller than the oracle value C Q 0 u ( λ C V > λ 0 ), That is, C Q n , C V u optimally trades off bias and variance for estimation of Q 0 , but fixing C Q u at this choice C Q n , C V u might oversimplify the complexity of the target Q n * of the bootstrap distribution, and thereby causes the bootstrap to under-estimate the variability of the true sampling distribution of the TMLE. As a result, the bootstrap confidence interval will potentially still be anti-conservative.

Since the oracle choice λ 0 is unknown, we propose to estimate λ 0 with a plateau selection method. Consider a pre-specified ordered (from large to small) sequence of lambda candidates Λ = ( λ 1 , λ 2 , , λ J ) with corresponding HAL-MLEs Q n , λ j and HAL-TMLEs Q n , λ j * , j = 1 , , J . We set λ 1 = λ n , C V so that we only consider sectional variation norm constraints larger than the cross-validation selector C Q n , C V u . The sectional variation norm of Q n , λ j will thus be increasing in j. For each λ j we compute the width w j = ( q 0.975 , n , λ j # q 0.025 , n , λ j # ) σ n of the nonparametric bootstrap confidence interval based on bootstrapping the standarized TMLE n 1 / 2 ( Ψ ( Q n , λ j * ) Ψ ( Q 0 ) ) / σ n , given by [ Ψ ( Q n * ) + q 0.025 , n , λ j # σ n , Ψ ( Q n * ) + q 0.975 , n , λ j # σ n ] , j = 1 , , J . The interval widths monotonically increase and should generally show de-acceleration around λ 0 where it will move towards a plateau, and, eventually it might become erratic. A related theoretical reference of this phenomena is Theorem 1 in Davies and van der Laan [19], which proposes such a plateau selector, and proves the consistency of the corresponding plateau variance estimator in a growing model that eventually captures the true data distribution (even though, in practice, plateau appear in much greater generality). As the model grows, the variance estimator keeps increasing, but once the model contains the true distribution the variance estimator is consistent/unbiased. Therefore, for large sample sizes we should see the same or similar variance estimate, but once model gets too big relative to sample size (defined in the Theorem 1 in Davies and van der Laan [19]), it becomes too variable and erratic. In our methodology, the widths of the confidence intervals correspond with variance/uncertainty estimation, and our models grow due to increase in sectional variation norm, and, indeed, as the sectional variation norm passes the variation norm of the true function, the growing model captures the truth.

So a similar intuition holds for our estimator. If we set variation norm C Q u smaller than true C Q 0 u , and let n go to infinity, we are inconsistent (negatively biased) for the true variance. If we set C Q u > C Q 0 u , any TMLE is efficient so will have the same asymptotic variance. In between as we increase C Q u towards C Q 0 u , the width of the confidence intervals grows accordingly. Therefore, we expect for large sample size to see that the width curve will increase as C Q u moves towards C Q 0 u and become flat after C Q u > C Q 0 u . Through numerical simulations, we indeed observed that λ 0 is near where the plateau begins. It remains to decide on a method for determining the location of the start of the de-acceleration. A variety of methods could be proposed here. In our concrete implementation demonstrated in our simulation study, we compute the location of the start of the plateau as the location at which the second derivative is maximized, where we use the log λ -scale (due to λ having very small values). Specifically, λ p l a t e a u λ j , where

(18) j = arg max j = 2 , , J 1 ( w j + 1 w j ) ( w j w j 1 ) ( log ( λ j + 1 ) log ( λ j ) ) ( log ( λ j ) log ( λ j 1 ) )

We choose a log-uniform grid of pre-specified λ to simplify the finite difference estimation of the derivative, and we leave it an important future work to implement a potentially better estimator with more flexible choice of λ grid.

Figure 1 illustrates a simulated example of the curve log ( λ ) w ( λ ) . As the value of λ decreases starting at λ C V , we observe a slow increase initially (almost a flat area around λ C V ), then an accelerated increase, till it starts reaching its plateau right after λ 0 . Our method looks for the numerical maximum of (discrete) second-order derivative (18), where the function starts moving towards the plateau. Another method might be to look for the actual start of the plateau, but our concern is that this might corresponds with a plateau due to pure overfitting the data (where the finite sample only allows so much overfitting).

Figure 1: 
(A) A simulated example of Wald-type interval width as a function of λ. (B) The first and second order derivatives of the same curve. Vertical lines indicate 




λ
0



${\lambda }_{0}$


, 




λ

C
V




${\lambda }_{CV}$


 and 




λ

p
l
a
t
e
a
u




${\lambda }_{plateau}$


.
Figure 1:

(A) A simulated example of Wald-type interval width as a function of λ. (B) The first and second order derivatives of the same curve. Vertical lines indicate λ 0 , λ C V and λ p l a t e a u .

Increasing the scaling σ n -factor by taking into account bias of bootstrap sampling distribution: Another modification we propose concerns the bias of the bootstrap distribution. We assume that we used the above method for selecting a λ n = λ p l a t e a u . We will use as point estimate Ψ ( Q n * ) , where Q n * = Q n , λ n , C V * , i.e., the TMLE using the cross-validated HAL-MLE. So the role of the bootstrap is to determine a confidence interval around this point estimate. Our confidence interval will be of the form [ Ψ ( Q n * ) + q n , 0.025 # σ n # / n 1 / 2 , Ψ ( Q n * ) + q n , 0.975 # σ n # / n 1 / 2 ] , where we use the nonparametric bootstrap at fixed sectional variation norm implied by λ n , but centered to have mean zero, to obtain these two quantiles. The bias in the bootstrap distribution will instead be incorporated in σ n # by defining σ n # 2 as the MSE of the bootstrap realizations Ψ ( Q n , i # * ) relative to Ψ ( Q n * ) , i = 1 , , N , where N is the number of bootstrap samples drawn from P n .

The motivation is that in general the nonparametric bootstrap will also inherit bias of the sampling distribution of n 1 / 2 ( Ψ ( Q n * ) Ψ ( Q 0 ) ) / σ n . For example, if there is finite sample bias of Ψ ( Q n * ) that is hurting the coverage of a Wald-type confidence interval, the bootstrap distribution (i.e., its quantiles) will likely further bias in the same direction. We choose not to estimate the bias with the bootstrap and compensate the bootstrap distribution accordingly through shifting it, since estimates of bias are typically unreliable. Instead, we widen the bootstrap confidence interval by replacing the scaling factor σ n by the square root of the MSE of Ψ ( Q n # * ) w.r.t. Ψ ( Q n * ) . Specifically, the “RMSE-scaled bootstrap” takes the form

(19) [ Ψ ( Q n * ) + σ n # q n , 0.025 # / n 1 / 2 , Ψ ( Q n * ) + σ n # q n , 0.975 # / n 1 / 2 ] ,

where (using short-hand notation)

σ n # 1 N i = 1 N ( Ψ i , n # * Ψ ( Q n * ) ) 2 = bias ( Ψ i , n # * ) 2 + stddev ( Ψ i , n # * ) 2

is the estimated RMSE of the bootstrap estimator Ψ i , n # * = Ψ ( Q n , i # * ) , and q n , α # is the α-quantile of the bootstrap distribution of standardized Z i , n # = n 1 / 2 ( Ψ i , n # * 1 N i = 1 N Ψ i , n # * ) / stddev ( Ψ i , n # * ) .

The full modified HAL-TMLE bootstrap procedure we propose in this article can be summarized in the following pseudo-algorithm:

5 Examples

In this section we apply our general theorem, by verifying its conditions, for asymptotic consistency of the nonparametric bootstrap of HAL-TMLE to two examples involving a nonparametric model. In the next section we will actually implement our nonparametric bootstrap based confidence intervals for these two examples, carry out a simulation study, and evaluate its practical performance w.r.t. finite sample coverage.

5.1 Nonparametric estimation of average treatment effect

Let O = ( W , A , Y ) P 0 , where W [ 0 , τ 1 ] IR 0 m 1 is an m 1 -dimensional vector of baseline covariates, A { 0,1 } is a binary treatment, and Y { 0,1 } is a binary outcome. For a possible data distribution P, let Q ( P ) = E P ( Y | A , W ) , Q ( P ) = P ( A = 1 | W ) , and let Q W ( P ) be the cumulative probability distribution of W. Let Q 1 = Q W , Q 2 = logit Q , Q = ( Q 1 , Q 2 ) , and G = logit Q . Let g ( a | W ) = P ( A = a | W ) = Q ( W ) a ( 1 Q ( W ) ) 1 a . In addition, let m 11 = m 1 and m 12 = m 1 + 1 , in terms of our general notation. Suppose that our model assumes that Q ( W ) depends on a possible subvector of W, and let m 2 be the dimension of this subvector.

Statistical model: Since Q 1 = Q W is a cumulative distribution function, it is a monotone m 1 -variate cadlag function and its sectional variation norm equals its total variation which thus equals 1. We assume that Q 2 is an element of the class of m 12 -dimensional cadlag functions with sectional variation norm bounded from above by some C Q 2 u . Here one can treat A as continuous on [ 0,1 ] and assume that Q 2 is a step-function in A with single jump at 1, allowing us to embed functions of continuous and discrete covariates in a cadlag function space. Similarly, we assume G is an element of the class of m 2 -dimensional cadlag functions with sectional variation norm bounded by a C G u . Let’s denote these parameter spaces for Q 1 , Q 2 and G with F 11 , F 12 and F 2 , respectively. Let F 1 = F 11 × F 12 be the parameter space of Q = ( Q 1 , Q 2 ) . For a given C Q u = ( C Q 1 u = 1 , C Q 2 u ) , C G u < , consider the statistical model

(20) M = { P : Q W F 11 , Q F 12 , G F 2 } .

Thus, M is defined as the set of all possible probability distributions for which the logit of the conditional means of Y and A are cadlag functions with sectional variation norm bounded by C Q u and C G u , respectively. Since logit Q and logit Q are bounded in supremum norm (implied by their bounds on the sectional variation norm), it follows that Q and Q are bounded from below by δ > 0 and from above by 1 δ for some δ > 0 . We will refer to this bound δ = δ ( C Q u , C G u ) separately in our bounds below, even though it is implied by the sectional variation norm bound C u . In particular, this implies the strong positivity assumption min a { 0,1 } g ( a | W ) > δ > 0 Q W a . e .

Notice that indeed our parameter space for Q = ( Q 1 , Q 2 ) and of G is of type (6) or (7). Specifically, Q W is of the type (7) with C Q 1 l = C Q 1 u = 1 , while Q 2 and G have parameter spaces of type (6) with only an upper bound ( C Q 2 u , C G u ) on their sectional variation norm. This demonstrates that our model M is represented as the general model formulation defined in Section 2.

Target parameter: Let Ψ : M IR be defined by Ψ ( P ) = Ψ 1 ( P ) Ψ 0 ( P ) , where Ψ a ( P ) = E P E P ( Y | A = a , W ) . Note that Ψ ( P ) only depends on P through Q ( P ) = ( Q 1 , Q 2 ) , so that we will also use the notation Ψ ( Q ) instead of Ψ ( P ) . Let’s focus on Ψ 1 ( P ) which will also imply the formulas for Ψ 0 ( P ) and thereby Ψ ( P ) .

Loss functions for Q and G: Let L 11 ( Q W ) = x ( I ( W x ) Q W ( x ) ) 2 r ( x ) d x for some weight function r > 0 be the loss function for Q 10 = Q W , 0 . Let

d 01 ( Q W , Q W , 0 ) = P 0 L 11 ( Q W ) P 0 L 11 ( Q W , 0 )

be the corresponding loss-based dissimilarity. Let L 12 ( Q 2 ) = { Y log Q ( A , W ) + ( 1 Y ) log ( 1 Q ( A , W ) ) } be the log-likelihood loss function for the conditional mean Q 0 and thereby Q 20 = logit Q 0 . Let d 02 ( Q 2 , Q 20 ) = P 0 L 12 ( Q 2 ) P 0 L 12 ( Q 20 ) be the corresponding Kullback–Leibler dissimilarity. We can then define the sum-loss L 1 ( Q ) = L 11 ( Q 1 ) + L 12 ( Q 2 ) for Q 0 = ( Q 10 , Q 20 ) , and its loss-based dissimilarity d 01 ( Q , Q 0 ) = P 0 L 1 ( Q ) P 0 L 1 ( Q 0 ) which equals the sum of the following two dissimilarities:

d 01 ( Q 1 , Q 10 ) = x ( Q W ( x ) Q W , 0 ( x ) ) 2 r ( x ) d x , d 02 ( Q 2 , Q 20 ) = log [ ( Q 0 Q ) y ( 1 Q 0 1 Q ) 1 y ] ( a , w ) d P 0 ( w , a , y ) ,

Q = Q ( Q 2 ) and Q 0 = Q ( Q 20 ) are implied by Q 2 and Q 20 , respectively. Let L 2 ( G ) = { A log Q ( W ) + ( 1 A ) log ( 1 Q ( W ) ) } be the loss function for G 0 = logit P 0 ( A = 1 | W ) , and let d 02 ( G , G 0 ) = P 0 L 2 ( G ) P 0 L 2 ( G 0 ) be the Kullback-Leibler dissimilarity between G and G 0 .

Canonical gradient and corresponding exact second order expansion:

The canonical gradient of Ψ a at P is given by:

D a * ( Q , G ) = I ( A = a ) g ( A | W ) ( Y Q ( A , W ) ) + Q ( a , W ) Ψ a ( Q ) .

The exact second-order remainder R 20 a ( P , P 0 ) Ψ a ( P ) Ψ a ( P 0 ) + P 0 D a * ( P ) is given by:

R 20 a ( Q 2 , G , Q 20 , G 0 ) = ( g g 0 ) ( a | w ) g ( a | w ) ( Q Q 0 ) ( a , w ) d P 0 ( w ) .

Bounding the second order remainder: By using Cauchy-Schwarz inequality, we obtain the following bound on R 20 a ( P , P 0 ) :

| R 20 a ( P , P 0 ) | δ 1 Q a Q a 0 P 0 G G 0 P 0 ,

where Q a ( W ) = Q ( a , W ) , a { 0 , 1 } . Thus, D * ( P ) = D 1 * ( P ) D 0 * ( P ) , R 20 ( P , P 0 ) = R 20 1 ( P , P 0 ) R 20 0 ( P , P 0 ) , and the upper bound for R 20 ( P , P 0 ) can be defined as the sum of the two upper bounds for R 20 a ( P , P 0 ) in the above inequality, a { 0 , 1 } .

By van der Vaart [16] we have | | p 1 / 2 p 0 1 / 2 | | P 0 2 P 0 log p 0 / p , where p and p 0 are densities of P and P 0 , with P 0 P . For Bernoulli distributions, we have | | p p 0 | | P 0 2 4 | | p 1 / 2 p 0 1 / 2 | | P 0 2 4 P 0 log p 0 / p . Following the same proof as in Lemma 4 of van der Laan [7], we note p is playing role of p ( Y | A , W ) . so d 0 ( p , p 0 ) is our KL dissimilarity d 02 ( Q , Q 0 ) . It now remains to show p p 0 2 = a , w y ( p p 0 ) 2 ( y | a , w ) d P 0 ( y | a , w ) d P 0 ( a , w ) , where y is just sum over y = 0 and y = 1 . Since Y is binary, p ( y = 0 | a , w ) = 1 p ( y = 1 | a , w ) = 1 Q ( a , w ) , and we obtain ( Q Q 0 ) 2 ( a , w ) d P 0 ( a , w ) 4 d 02 ( Q , Q 0 ) and thus | | Q a Q a 0 | | P 0 2 4 δ 1 d 02 ( Q , Q 0 ) . Therefore, we conclude that | | Q a Q a 0 | | P 0 2 δ 1 / 2 d 02 1 / 2 ( Q , Q 0 ) . Similarly, it follows that G G 0 P 0 2 d 02 1 / 2 ( G , G 0 ) . This thus shows the following bound on R 20 a ( P , P 0 ) :

| R 20 a ( P , P 0 ) | 2 δ 1.5 d 02 1 / 2 ( Q , Q 0 ) d 02 1 / 2 ( G , G 0 ) .

The right-hand side represents the function f ( d 01 1 / 2 ( Q , Q 0 ) , d 02 1 / 2 ( G , G 0 ) ) for the parameter Ψ a in our general notation: f ( x = ( x 1 , x 2 ) , y ) = 4 δ 1.5 x 2 y . The sum of these two bounds for a { 0,1 } (i.e., 2 f ( ) ) provides now a conservative bound for R 20 = R 20 1 R 20 0 :

(21) | R 20 ( P , P 0 ) | f ( d 02 1 / 2 ( Q , Q 0 ) , d 02 1 / 2 ( G , G 0 ) ) 8 δ 1.5 d 02 1 / 2 ( Q , Q 0 ) d 02 1 / 2 ( G , G 0 ) .

This verifies (8). We note that this bound is very conservative due to the arguments we provided in general in the previous section for double robust estimation problems.

Continuity of canonical gradient: Regarding the continuity assumption (9), we note that P 0 { D a * ( P ) D a * ( P 0 ) ) 2 can be bounded by | | G G 0 | | P 0 2 + | | Q a Q a 0 | | P 0 2 and ( Ψ a ( Q ) Ψ a ( Q 0 ) ) 2 , where the constant depends on δ. The latter square difference can be bounded in terms of | | Q a Q a 0 | | P 0 2 and by applying our integration by parts formula to Q a ( w ) d ( Q W Q W 0 ) ( w ) by d 01 ( Q W , Q W 0 ) , where the multiplicative constant depends on C Q u . We conclude that P 0 { D a * ( P ) D a * ( P 0 ) ) 2 is bounded in terms of d 01 ( Q , Q 0 ) + d 02 ( G , G 0 ) . Thus this proves (9) for D * = D 1 * D 0 * .

Uniform model bounds on sectional variation norm: It also follows immediately that the sectional variation norm model bounds M 1 , M 2 , M 3 (4) of L 1 ( Q ) , L 2 ( G ) and D * ( P ) are all finite, and can be expressed in terms of ( C Q u , C G u , δ ) . This verifies the model assumptions of Section 2.

HAL-MLEs: Let Q n = arg min Q F 1 P n L 1 ( Q ) and G n = arg min G F 2 P n L 2 ( G ) be the HAL-MLEs. As shown in [7], [8], Q n and G n can be computed with standard LASSO logisitic regression software using a linear logistic regression model with around n 2 m 1 indicator basis functions, where m 1 is the dimension of W.

Note that Q W , n is just an unrestricted MLE and thus equals the empirical cumulative distribution function. Therefore, we actually have that | | Q W , n Q W , 0 | | = O P ( n 1 / 2 ) in supremum norm, while d 02 ( Q 2 n , Q 20 ) and d 02 ( G n , G 0 ) = O P ( n 1 / 2 α / 4 ) where d is the dimension of O. If m 2 < d 2 , then one should be able to improve the bound into n 1 / 2 α ( m 2 ) .

CV-HAL-MLEs: The above HAL-MLEs are determined by ( C Q u = ( 1 , C Q 2 u ) , C G u ) and could thus be denoted with Q n , C Q u = Q ^ C Q u ( P n ) and G n , C G u = G ^ C G u ( P n ) . Let C Q 0 = Q 0 v * = ( 1 , Q 20 v * ) and C G 0 = Q 0 v * , respectively, which are thus smaller than C Q u and C G u , respectively. We can now define the cross-validation selector that selects the best HAL-MLE over all C Q and C G smaller than these upper-bounds:

C Q n = arg min C Q 1 = 1 , C Q 2 < C Q 2 u E B n P n , B n 1 L 1 ( Q ^ C Q ( P n , B n 0 ) ) C G n = arg min C G < C G u E B n P n , B n 1 L 2 ( G ^ C G ( P n , B n 0 ) ) ,

where B n { 0,1 } n is a random split in training sample { O i : B n ( i ) = 0 } with empirical measure P n , B n 0 and validation sample { O i : B n ( i ) = 1 } with empirical measure P n , B n 1 . This defines now the CV-HAL-MLE Q n = Q n , C Q n and G n = G n , C G n as well. Thus, by setting C Q u = C Q n and C G u = C G n , our HAL-MLEs equal the CV-HAL-MLE.

HAL-TMLE: Let logit Q n , ϵ = logit Q n + ϵ C ( G n ) , or, equivalently, Q 2 n , ϵ = Q 2 n + ϵ C ( G n ) , where C ( G n ) ( A , W ) = ( 2 A 1 ) / g n ( A | W ) . Let ϵ n = arg min ϵ P n L 11 ( Q n , ϵ ) . This defines the TMLE Q n * = Q n , ϵ n of Q 0 , and thereby Q 2 n * = Q 2 n , ϵ n . We can also define a local least favorable submodel { Q W , n , ϵ 2 : ϵ 2 } for Q W , n but since Q W , n is an NPMLE one will have that ϵ 2 n = arg min ϵ 2 ϵ 2 P n L 11 ( Q W , n , ϵ 2 ) = 0 , and thereby that the TMLE of Q 0 for any such 2-dimensional least favorable submodel is given by Q n * = ( Q W , n , Q 2 n * ) . It follows that P n D * ( Q n * , G n ) = 0 .

Preservation of rate for HAL-TMLE: Lemma 2 in Appendix A shows d 01 ( Q n * , Q 0 ) converges at same rate O P ( n 1 / 2 α / 4 ) as d 01 ( Q n , Q 0 ) .

Asymptotic efficiency of HAL-TMLE and CV-HAL-TMLE: Application of Theorem 1 shows that Ψ ( Q n * ) is asymptotically efficient, where one can either choose Q n as a fixed HAL-MLE using C Q = C Q u or the CV-HAL-MLE using C Q = C Q n , and similarly, for G n . The preferred estimator would be the CV-HAL-TMLE.

Asymptotic validity of the nonparametric bootstrap for the HAL-MLEs: Firstly, note that the bootstrapped HAL-MLEs

Q 2 n # = arg min Q 2 v * < C Q 2 u , Q 2 Q 2 n P n # L 12 ( Q 2 ) ,

and

G n # = arg min Q v * < C G u , G G n P n # L 2 ( G )

are easily computed as a standard LASSO regression using L 1 -penalty C Q 2 u and C G u and including the n indicator basis functions with the non-zero coefficients selected by Q n and G n , respectively. This makes the actual computation of the nonparametric bootstrap distribution a very doable computational problem, even though the single computation of Q n and G n is highly demanding for large dimension of W and sample size n. Q 1 n # = Q W , n # is simply the empirical probability measure of W 1 # , , W n # by sampling n i.i.d. observations from Q W , n .

Behavior of HAL-MLE under sampling from P n : By Theorem 3 we have that each of the terms d n 12 ( Q 2 n # , Q 2 n ) ; d n 2 ( G n # , G n ) ; d 01 ( Q 2 n # , Q 20 ) and d 02 ( G n # , G 0 ) are O P ( n 1 / 2 α / 4 ) .

Preservation of rate of TMLE under sampling from P n : Lemma 4 in Appendix C proves that indeed d 01 ( Q n # * , Q 0 ) converges at same rate O P ( n 1 / 2 α / 4 ) as d 01 ( Q n , Q 0 ) .

Consistency of nonparametric bootstrap for HAL-TMLE: This verifies all conditions of Theorem 4 which establishes the asymptotic efficiency and asymptotic consistency of the nonparametric bootstrap.

Theorem 5

Consider statistical model M defined by (20), indexed by sectional variation norm bounds C u . Let the statistical target parameter Ψ : M IR be defined by Ψ ( P ) = Ψ 1 ( P ) Ψ 0 ( P ) , where Ψ a ( P ) = E P E P ( Y | A = a , W ) . Consider the HAL-TMLE Q n * of Q 0 defined above. We have that Ψ ( Q n * ) is asymptotically efficient, i.e., n 1 / 2 ( Ψ ( Q n * ) Ψ ( Q 0 ) ) d N ( 0 , σ 0 2 ) , where σ 0 2 = P 0 { D * ( P 0 ) } 2 . In addition, conditional on ( P n : n 1 ) , Z n 1 , # = n 1 / 2 ( Ψ ( Q n # * ) Ψ ( Q n * ) ) d N ( 0 , σ 0 2 ) . This can also be applied to the setting in which C u is replaced by the cross-validation selector C n u defined above, or any other data adaptive selector C ˜ n u satisfying P ( C n u C ˜ n u C u ) = 1 .

5.2 Nonparametric estimation of integral of square of density

Statistical model, target parameter, canonical gradient: Let O IR d be a multivariate random variable with probability distribution P 0 with support [ 0 , τ ] . Let M be a nonparametric model dominated by Lebesgue measure μ, where we assume that for each P M its density p = d P / d μ is bounded from above by some M < and from below by some δ > 0 . Consider the parametrization p ( x ) = p ( Q ) ( x ) = c ( Q ) 1 1 + exp ( Q ( x ) ) , where c ( Q ) is the normalizing constant. Our model M also assume that Q varies over all cadlag functions with sectional variation norm bounded by C u < . Due to the C u -bound, we also have that any density p in our model is bounded from above by an M = M ( C u ) and from below by a δ = δ ( C u ) . This shows that our model M = { P : Q ( P ) D [ 0 , τ ] , Q v * < C u } is of the type (6).

An alternative formulation that avoids a normalizing constant c ( Q ) is the following. For sake of presentation, let’s consider the case that d = 2 . We factorize p ( x ) = p 1 ( x 1 ) p 2 ( x 2 | x 1 ) . Subsequently, we parametrize p 1 in terms of its hazard λ 1 ( x 1 ) = p 1 ( x 1 ) / x 1 τ 1 p 1 ( u ) d u , and p 2 in terms of its conditional hazard λ 2 ( x 2 | x 1 ) = p 2 ( x 2 | x 1 ) / x 2 τ 2 p 2 ( v | x 1 ) d v . We then parametrize λ 1 = exp ( Q 1 ) and λ 2 = exp ( Q 2 ) so that the functions Q 1 ( x 1 ) and Q 2 ( x 2 | x 1 ) are unrestricted. This then defines a parameterization p = p Q 1 , Q 2 . The log-likelihood provides valid loss functions for Q 1 and Q 2 , so that the HAL-MLE can be computed by maximizing the log-likelihood over linear combinations of a large number of indicator basis functions with a bound on the L 1 -norm of its coefficients.

The target parameter Ψ : M IR is defined as Ψ ( P ) = E P p ( O ) = p 2 ( o ) d μ ( o ) . We can represent Ψ ( P ) as a function of Q or the density p, so that we will also denote it with Ψ ( Q ) or Ψ ( p ) . This target parameter is pathwise differentiable at P with canonical gradient

D * ( Q ) ( O ) = 2 ( p ( O ) Ψ ( p ) ) ,

where p = p ( Q ) .

Exact second order remainder: It implies the following exact second-order expansion:

Ψ ( Q ) Ψ ( Q 0 ) = ( P P 0 ) D * ( Q ) + R 20 ( Q , Q 0 ) ,

where

R 20 ( Q , Q 0 ) ( p p 0 ) 2 d μ .

Loss function: As loss function for Q we could consider the log-likelihood loss L ( Q ) ( O ) = log p ( O ) with d 0 ( Q , Q 0 ) = P 0 log p 0 / p , where again p 0 = p ( Q 0 ) and p = p ( Q ) . We have | | p 1 / 2 p 0 1 / 2 | | P 0 2 P 0 log p 0 / p so that

| R 20 ( Q , Q 0 ) | = ( p p 0 ) 2 d μ = sup ( p 1 / 2 + p 0 1 / 2 ) 2 p 0 ( p 1 / 2 p 0 1 / 2 ) 2 d P 0 M / δ P 0 log p 0 / p = M / δ d 0 ( Q , Q 0 ) .

Alternatively, we could consider the loss function

L ( Q ) ( O ) = 2 p ( O ) + p 2 d μ .

Note that this is indeed a valid loss function with loss-based dissimilarity given by

d 0 ( Q , Q 0 ) = P 0 L ( Q ) P 0 L ( Q 0 ) = 2 p ( o ) p 0 ( o ) d μ ( o ) + p 2 d μ + 2 p 0 2 d μ p 0 2 d μ = ( p p 0 ) 2 d μ .

Bounding second order remainder: Thus, if we select this loss function, then we have

| R 20 ( Q , Q 0 ) | = d 0 ( p , p 0 ) .

In terms of our general notation, we now have f ( x ) = x 2 for the upper bound on R 20 so that | R 20 ( Q , Q 0 ) | = f ( d 0 1 / 2 ( Q , Q 0 ) ) . The canonical gradient is indeed continuous in Q as stated in (9) and the bounds M 1 , M 2 , M 3 (4) are obviously finite and can be expressed in terms of ( C u , M ( C u ) , δ ( C u ) ) . This verifies the assumptions on our model as stated in Section 2.

HAL-MLE and CV-HAL-MLE: Let Q n = arg min Q , | | Q | | v * < C u P n L ( Q ) be the HAL-MLE. where Q can be represented by our general representation (5), Q ( o ) = Q ( 0 ) + s { 1 , , d } ( 0 s , o s ] d Q s ( u s ) , and constrained to satisfy

| Q ( 0 ) | + s \{ 1 , , d \} ( 0 s , τ s ] | d Q s ( u s ) | C u .

Let’s denote this Q n with Q n , C u . Thus, for a given C, computation of Q n , C can be done with a LASSO type algorithm. Let C n = arg min C E B n P n , B n 1 L ( Q ^ C ( P n , B n 0 ) ) be the cross-validation selector of C, as defined in previous example. If we set C = C n , then we obtain the CV-HAL-MLE Q n = Q n , C n . By our general result on HAL-MLE for bounded loss functions, we have (for both the log-likelihood loss and L 2 -loss functions) d 0 ( Q n , Q 0 ) = O P ( n 1 / 2 α / 4 ) .

TMLE using Local least favorable submodel and log-likelihood loss: Let p n = p ( Q n ) . A possible local least favorable submodel through p n when using the log-likelihood loss is given by p n , ϵ l f m = ( 1 + ϵ D * ( p n ) ) p n for ϵ in a small enough neighborhood so that p n , ϵ > 0 everywhere: for example, | ϵ | < 1 / p n . Let ϵ n = arg min ϵ P n L ( p n , ϵ l f m ) . If ϵ n is not in the interior, then one would set p n 1 = p n , ϵ n and iterate this updating process till ϵ n falls in interior. The final update is denoted with p n * . As sample size increases, with probability tending to one ϵ n will already be in the interior, so that p n * would be a closed form one-step TMLE. Due to the o P ( n 1 / 4 ) -rate of convergence of the HAL-MLE p n , it follows that P n D * ( p n , ϵ l f m ) = o P ( n 1 / 2 ) .

TMLE using Universal least favorable submodel and log-likelihood loss: One can also define a universal least favorable submodel [16] by recursively applying the above local least favorable submodel:

p n , ϵ + d ϵ = p n , ϵ , d ϵ l f m ,

where p n , ϵ , d ϵ l f m is the local least favorable submodel through p n , ϵ at parameter value d ϵ . In this manner, the local moves of the local least favorable submodel describe a submodel satisfying d / d ϵ log p n , ϵ = D * ( p n , ϵ ) at each ϵ. Again, let ϵ n = arg min ϵ P n L ( p n , ϵ ) and p n * = p n , ϵ n , which now satisfies P n D * ( p n * ) = 0 exact.

HAL-TMLE: The TMLE of Ψ ( Q 0 ) = Ψ ( p 0 ) is the plug-in estimator ψ n * = Ψ ( p n * ) = p n * 2 d μ . Lemma 2 in Appendix A proves d 01 ( Q n * , Q 0 ) = O P ( n 1 / 2 α / 4 ) .

Efficiency of HAL-TMLE and CV-HAL-TMLE: Theorem 1 shows that Ψ ( p n * ) is asymptotically efficient, where one can either choose the HAL-MLE with fixed index C = C u or one can set C = C n equal to cross-validation selector defined above (or any other data adaptive selector C ˜ n with P ( C n C ˜ n C u ) = 1 .

Asymptotic validity of the nonparametric bootstrap for the HAL-MLE: Let C be given. As remarked in the previous example, computation of the HAL-MLE Q n # = arg  min Q v * C , Q Q n P n # L ( Q ) is much faster than the computation of Q n = arg  min | | Q | | v * C P n L ( Q ) , due to only having to minimize the empirical risk over the bootstrap sample over the linear combinations of indicator functions that had non-zero coefficients in Q n . By Theorem 3 it follows that d n ( Q n # , Q n ) and d 0 ( Q n # , Q 0 ) are O P ( n 1 / 2 α / 4 ) . For example, if we use the L 2 -loss function above, then d 0 ( Q , Q 0 ) = ( p p 0 ) 2 d μ , and, we note also that

d n ( Q n # , Q n ) = P n { L ( Q n # ) L ( Q n ) } = P n { 2 ( p n # p n ) + p n # 2 d μ p n 2 d μ } = P n { 2 ( p n # p n ) + ( p n # p n ) ( p n # + p n ) d μ } = P n { ( 2 + 2 p n ) ( p n # p n ) d μ } = ( p n # p n ) 2 d μ .

This shows that the empirical dissimilarity also equals the square of an L 2 -norm. Thus, application of Theorem 3 now shows that ( p n # p n ) 2 d μ and ( p n # p 0 ) 2 d μ are both O P ( n 1 / 2 α / 4 ) .

Preservation of rate for HAL-TMLE under sampling from P n : Lemma 4 in Appendix C establishes that d 01 ( Q n # * , Q 0 ) = O P ( n 1 / 2 α / 4 ) .

Asymptotic consistency of the bootstrap for the HAL-TMLE: This verifies all conditions of Theorem 4 which establishes the asymptotic efficiency and asymptotic consistency of the nonparametric bootstrap.

Theorem 6

Consider the model M defined by upper bound C u < on the sectional variation norm of Q over [ 0 , τ ] . Let Ψ ( Q ) = p ( Q ) 2 d μ , which is also denote with Ψ ( p ) . Consider the one-step TMLE based on the local least favorable submodel or universal least favorable submodel and the log-likelihood loss.

We have that Ψ ( p n * ) is asymptotically efficient, i.e., n 1 / 2 ( Ψ ( p n * ) Ψ ( p 0 ) ) d N ( 0 , σ 0 2 ) , where σ 0 2 = P 0 { D * ( P 0 ) } 2 .

In addition, conditional on ( P n : n 1 ) , Z n 1 , # = n 1 / 2 ( Ψ ( p n # * ) Ψ ( p n * ) ) d N ( 0 , σ 0 2 ) .

This theorem can also be applied to the setting in which C u = C n .

6 Simulation study evaluating performance of bootstrap method

6.1 Average treatment effect

To illustrate the finite sample performance of the proposed bootstrap method, we simulate a continuous outcome Y, a binary treatment A, and a continuous covariate W that confounds Y and A. The random variables are drawn from a family of distributions indexed by a 1 , which characterizes the conditional distribution of Y, given A and W (Figure 2). The distribution of variables are as follows: W N ( 0 , 4 2 , 10,10 ) is drawn i.i.d. from a truncated normal distribution with mean equals 0, standard deviation 4, bounded within [ 10,10 ] . A B e r n o u l l i ( Q ( W ) ) is a Bernoulli binary random variable, with a probability Q ( W ) as a function of W, given by

Q ( W ) = 0.3 + min ( 0.1 W sin ( 0.1 W ) + ϵ 1 , 0.4 )

where ε 1 N ( 0 , 0.05 2 ) . Y = 3 sin ( a 1 W ) + A + ε 2 is a sinusoidal function of W, where ε 2 N ( 0,1 ) , which defines Q 0 ( A , W ) = 3 sin ( a 1 W ) + A . a 1 controls the amplitude of the sinusoidal function. Increasing a 1 (frequency) of the sin function increases the sectional variation norm of Q 0 proportionally, so that estimating Q 0 becomes more difficult under fixed sample size. In our study, we increase a 1 while fixing the sample size and fixing the HAL-MLE G n of G 0 , so that the second order remainder increases in magnitude. The value of the parameter of interest, ATE ψ 0 = Ψ ( P 0 ) , is 1. The experiment is repeated 1000 times. The estimation routine including the tuning parameter search is implemented in the ateBootstrap function in the open-source R package TMLEbootstrap [20]. A thousand replications of the simulation 1 under sample size 100 and 200 bootstrap repetitions takes 12 CPU hour on an Intel Core i7 4980HQ CPU.

Figure 2: 
(A) True conditional expectation functions of outcome 



E

(
Y
|
A
=
1
,
W
)



$E\left(Y\vert A=1,W\right)$


 and 



E

(
Y
|
A
=
0
,
W
)



$E\left(Y\vert A=0,W\right)$


 at 




a
1

=
0.5
,
1,3,5,10,15


${a}_{1}=0.5,1,3,5,10,15$


 and (B) true propensity score function.
Figure 2:

(A) True conditional expectation functions of outcome E ( Y | A = 1 , W ) and E ( Y | A = 0 , W ) at a 1 = 0.5 , 1,3,5,10,15 and (B) true propensity score function.

To analyze the above simulated data, we compute the coverage and width of confidence interval of the Wald-type confidence interval where the nuisance functions ( Q 0 , G 0 ) are estimated using HAL-MLE( λ C V ) and nonparametric bootstrap confidence interval presented in Section 4, where the choice λ in Q n , λ * is set equal to the plateau selector λ p l a t e a u . Recall that the nonparametric bootstrap of the HAL-TMLE at this choice λ p l a t e a u is used to determine the quantiles for the confidence interval around the TMLE Ψ ( Q n * ) . Wald-type interval reflects common practice for statistical inference based on the TMLE. Results under samples sizes 500 and 1000 are shown in Figure 3.

Figure 3: 
Results for ATE parameter comparing our bootstrap method and classic Wald-type method as a function of the 




a
1



${a}_{1}$


 coefficient (sectional variation norm) of the 






Q
¯


0



${\overline{Q}}_{0}$


 function. Panel A is the coverage of the intervals, where dashed line indicates 95% nominal coverage. Panel B is the widths of the intervals. Within each panel, the upper plot is under sample size 500 and the lower plot is under sample size 1000.
Figure 3:

Results for ATE parameter comparing our bootstrap method and classic Wald-type method as a function of the a 1 coefficient (sectional variation norm) of the Q ¯ 0 function. Panel A is the coverage of the intervals, where dashed line indicates 95% nominal coverage. Panel B is the widths of the intervals. Within each panel, the upper plot is under sample size 500 and the lower plot is under sample size 1000.

The simulation results reflect what is expected based on theory. In particular, as the sectional variation norm of the Q 0 becomes large relative to sample size, the HAL regression fit of Q 0 in the finite sample is not ideal, which leads to low coverage of Wald-type interval. On the other hand, the bootstrap confidence intervals reflect the deteriorating second-order remainder in the sampling distribution of the HAL-TMLE of Q 0 , and, as a result, the coverage is very close to nominal and is robust to increasing sectional variation norm ( a 1 ). The results for sample size 1000 confirm our asymptotic analysis of the methods, with Wald-type coverage improving and two methods eventually converging to nominal coverage.

6.2 Average density value

As we demonstrated, this problem has a non-forgiving second-order remainder term that is proportional to the L 2 -norm of p n * p 0 , which makes this example very suitable for evaluating finite sample coverage of the bootstrap methods. To illustrate our proposed method and explore finite-sample performance, we simulate a family of univariate densities with increasing sectional variation norm.

f ( x ; θ K ) = 1 K k = 1 K g ( x ; μ k , σ K ) ,

where

g ( x ; μ k , σ K ) = 1 2 π σ K exp [ 1 2 ( x μ k ) 2 / σ K 2 ] .

For a given K, μ k , k = 1 , , K are equi-distantly placed in interval [ 4,4 ] . σ K = 10 / K . The true sectional variation norm of the density increases roughly linearly with K, that is f K v * = K f 1 v * , K = 1 , , 13 . Examples of the density family for K values used in the simulation are shown in Figure 4. We simulate from univariate densities for the sake of presentation and we expect our results to be informative for higher dimensional densities as well, since the main difference will be that a sectional variation norm of a multivariate function is generally larger than that of a univariate function. The value of the parameter of interest Ψ ( p 0 ) = p 0 2 d x does not change much as a function of the choice K of data distribution. For each data distribution, the experiment is repeated 1000 times. As in our ATE simulation, we compute coverages and widths of the Wald-type confidence intervals (that ignore second order remainders) and our HAL-TMLE bootstrap confidence intervals using our plateau selector λ p l a t e a u .

Figure 4: 
True probability density function 



f

(

x
;

θ
K


)



$f\left(x;{\theta }_{K}\right)$


 at 



K
=
1
,
 
3
,
 
5
,
 
7
,
 
9
,
 
11
,
 
13


$K=1,\hspace{0.17em}3,\hspace{0.17em}5,\hspace{0.17em}7,\hspace{0.17em}9,\hspace{0.17em}11,\hspace{0.17em}13$


.
Figure 4:

True probability density function f ( x ; θ K ) at K = 1 , 3 , 5 , 7 , 9 , 11 , 13 .

We parametrized the density in terms of its hazard, discretized the hazard making it piecewise constant across a large number of bins (like histogram density estimation), parametrized this piecewise constant hazard with a logistic regression for the probability of falling in bin h, given it exceeded bin h 1 . We fitted this hazard with a logistic regression based HAL-MLE using the longitudinal data format common for hazard estimation (i.e., an observation O i is coded by a number of rows with binary outcome equal to zero and a final row with outcome 1). The HAL-MLE of this hazard yields the corresponding HAL-MLE of the density itself. The HAL-TMLE updates the HAL-MLE density estimator with a TMLE update using the universal least favorable submodel and log-likelihood loss. The software implementations can be found in the cv_densityHAL function in the open-source R package TMLEbootstrap [20].

The simulations reflect what is expected based on theory: the bootstrap confidence interval has superior coverage relative to the Wald-type confidence interval, uniformly across different sample sizes and data distributions (Figure 5). In particular, as the true sectional variation norm increases (with the number of modes in the density), the second-order remainder term increases so that the Wald-type interval coverage declines. On the other hand, the bootstrap confidence intervals reflect the behavior of the second order remainder and thereby increase in width as the performance of the HAL-MLE deteriorates (due to increased complexity of true density). The bootstrap confidence interval controls the coverage close to the nominal rate and its coverage is not very sensitive to the true sectional variation norm of the density function. When sample size increases to 1000, the Wald-type interval coverage increases, and in simple cases where the true sectional variation norm is small, Wald-type coverage reaches its desired nominal covarage.

Figure 5: 
Results for average density value parameter comparing our bootstrap method and classic Wald-type method as a function of the number of modes in true density (sectional variation norm). Panel A is the coverage of the intervals, where dashed line indicates 95% nominal coverage. Panel B is the widths of the intervals. Within each panel, the upper plot is under sample size 500 and the lower plot is under sample size 1000.
Figure 5:

Results for average density value parameter comparing our bootstrap method and classic Wald-type method as a function of the number of modes in true density (sectional variation norm). Panel A is the coverage of the intervals, where dashed line indicates 95% nominal coverage. Panel B is the widths of the intervals. Within each panel, the upper plot is under sample size 500 and the lower plot is under sample size 1000.

7 Discussion

On one hand, in parametric models and, more generally, in models small enough so that the MLE is still well behaved, one can use the nonparametric bootstrap to estimate the sampling distribution of the MLE. It is generally understood that in these small models the nonparametric bootstrap outperforms estimating the sampling distribution with a normal distribution (e.g., with variance estimated as the sample variance of the influence curve of the MLE), by picking up the higher order behavior of the MLE, if asymptotics has not set in yet. In such small models, reasonable sample sizes already achieve the normal approximation in which case the Wald type confidence intervals will perform well. Generally speaking, the nonparametric bootstrap is a valid method when the estimator is a compactly differentiable function of the empirical measure, such as the Kaplan–Meier estimator (i.e., one can apply the functional delta-method to analyze such estimators) [21] (Theorem 3.9.11 in [15]). These are estimators that essentially do not use smoothing of any sort.

On the other hand, efficient estimation of a pathwise differentiable target parameter in large realistic models generally requires estimation of the data density, and thereby machine learning such as super-learning to estimate the relevant parts of the data distribution. Therefore, efficient one-step estimators or TMLEs are not compactly differentiable functions of the data distribution. Due to this reason, we moved away from using the nonparametric bootstrap to estimate its sampling distribution, since it represents a generally inconsistent method (e.g., a cross-validation selector behaves very differently under sampling from the empirical distribution than under sampling from the true data distribution) [22]. Instead we estimated the normal limit distribution by estimating the variance of the influence curve of the estimator.

Such an influence curve based method is asymptotically consistent and therefore results in asymptotically valid 0.95-level confidence intervals. However, in such large models the nuisance parameter estimators will converge at slow rates (like n 1 / 4 or slower) with large constants depending on the size of the model, so that for normal sample sizes the exact second-order remainder could be easily larger than the leading empirical process term with its normal limit distribution. So one has to pay a significant price for using the computationally attractive influence curve based confidence intervals, where inference is ignoring the remainder terms ( P n P 0 ) ( D n * D 0 * ) and R 2 ( P n * , P 0 ) . In finite sample these remainder terms can have non-zero expectation or have a large variance, so the influence curve-based inference using a normal limit distribution can be off-centered or less spread out than the actual sampling distribution of the estimator.

One might argue that one should use a model based bootstrap instead by sampling from an estimator of the density of the data distribution. General results show that such a model based bootstrap method will be asymptotically valid as long as the density estimator is consistent [23], [24], [25]. This is like carrying out a simulation study for the estimator in question using an estimator of the true data distribution as sampling distribution. However, estimation of the actual density of the data distribution is itself a very hard problem, with bias heavily affected by the curse of dimensionality, and, in addition, it can be immensely burdensome to construct such a density estimator and sample from it when the data is complex and high dimensional.

As demonstrated in this article, the HAL-MLE provides a solution to this bottleneck. The HAL-MLE( C u ) of the nuisance parameter is an actual MLE minimizing the empirical risk over a infinite dimensional parameter space (depending on the model M ) in which it is assumed that the sectional variation norm of the nuisance parameter is bounded by universal constant C u . This MLE is still well behaved by being consistent at a rate that is in the worst case still faster than n 1 / 4 . However, this MLE is not an interior MLE, but will be on the edge of its parameter space: the MLE will itself have sectional variation norm equal to the maximal allowed value C u . Nonetheless, our analysis shows that it is still a smooth enough function of the data (while not being compactly differentiable at all) that it is equally well behaved under sampling from the empirical distribution.

As a consequence of this robust behavior of the HAL-MLE, for models in which the nuisance parameters of interest are cadlag functions with a universally bounded sectional variation norm (beyond possible other assumptions), we presented asymptotically consistent estimators of the sampling distribution of the HAL-TMLE of the target parameter of interest using the nonparametric bootstrap.

Our estimators of the sampling distribution are highly sensitive to the curse of dimensionality, just as the sampling distribution of the HAL-TMLE itself: specifically, the HAL-MLE on a bootstrap sample will converge just as slowly to its truth as under sampling from the true distribution. Therefore, in high dimensional estimation problems, we expect highly significant gains in valid inference relative to Wald type confidence intervals that are purely based on the normal limit distribution of the HAL-TMLE.

In general, the user will typically not know how to select the upper bound C u on the sectional variation norm of the nuisance parameters (except if the nuisance parameters are cumulative distribution functions). Therefore, for the sake of estimation of Q 0 and G 0 we recommend to select this bound with cross-validation. Due to the oracle inequality for the cross-validation selector C n (which only relies on a bound on the supremum norm of the loss function), the data adaptively selected upper bound will be selected larger than (but close to) the true sectional variation norm C 0 of the nuisance parameters ( Q 0 , G 0 ) , as sample size increases.

Even though, for this cross-validation selector C n , our bootstrap estimators will still be guaranteed to be consistent for its normal limit distribution, this choice C n will be trading off bias and variance for the sake of estimation of the nuisance parameter. As a consequence, in practice this C n u might often end up selecting a value significantly smaller than the true sectional variation norms of Q 0 and G 0 . This is comparable with selecting a models for Q 0 and G 0 to be used in the bootstrap that are potentially much smaller than a model that would be needed to capture Q 0 and G 0 . That is, our proposed bootstrap would then still not capture the full complexity of the estimation problem and still result in anti-conservative confidence intervals. Therefore we proposed a finite sample modification for the nonparametric bootstrap of the HAL-TMLE by using the bootstrap distribution for the HAL-TMLE at fixed sectional variation norm determined by a plateau selector (instead of the cross-validation selector of C u ). Our proposed finite sample modification also uses a scaling σ n 2 that incorporates both bias and variance of the bootstrap distribution. Our simulations demonstrate the importance of this finite sample modification and showcases excellent finite sample coverage. Any improvements for variance estimation relative to using the empirical variance of the influence curve can be incorporated naturally in this method, such as the plug-in variance estimator that is robust under data sparsity presented in [26].

There are a number of important future directions to this research. One direction is to derive finite-sample bounds on our bootstrap interval coverage probability, which will give additional guarantees for applications.


Corresponding author: Mark van der Laan, Division of Biostatistics, University of California, Berkeley, USA, E-mail:

Award Identifier / Grant number: 5R01AI074345-07

Acknowledgment

We thank the reviewers for the suggestion of the enlargement of F C class into a class H C shifted by an unrestricted constant, and the proof of the equivalence of two classes in metric entropy.

  1. Author contribution: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.

  2. Research funding: This research is funded by NIH-grant 5R01AI074345-07.

  3. Conflict of interest statement: The authors declare no conflicts of interest regarding this article.

References

1. Bickel, PJ, Klaassen, CAJ, Ritov, Y, Wellner, J. Efficient and adaptive estimation for semiparametric models. Berlin Heidelberg New York: Springer; 1997.Search in Google Scholar

2. Gill, RD, van der Laan, MJ, Wellner, JA. Inefficient estimators of the bivariate survival function for three models. Annales de l’Institut Henri Poincaré 1995;31:545–97.Search in Google Scholar

3. van der Laan, MJ, Rubin, DB. Targeted maximum likelihood learning. Int J Biostat 2006;2:Article 11. https://doi.org/10.2202/1557-4679.1043.Search in Google Scholar

4. van der Laan, MJ. Estimation based on case-control designs with known prevalance probability. Int J Biostat 2008;4:Article 17. https://doi.org/10.2202/1557-4679.1114.Search in Google Scholar PubMed

5. van der Laan, MJ, Rose, S. Targeted learning: causal inference for observational and experimental data. Berlin Heidelberg New York: Springer; 2011.10.1007/978-1-4419-9782-1Search in Google Scholar

6. van der Laan, MJ, Rose, S. Targeted learning in data science: causal inference for complex longitudinal studies. Berlin Heidelberg New York: Springer; 2017.10.1007/978-3-319-65304-4Search in Google Scholar

7. van der Laan, MJ. A generally efficient targeted minimum loss-based estimator. UC Berkeley; 2015. Technical Report 300. http://biostats.bepress.com/ucbbiostat/paper343.Search in Google Scholar

8. Benkeser, D, van der Laan, MJ. 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA). The Highly Adaptive Lasso Estimator 2016:689–96.10.1109/DSAA.2016.93Search in Google Scholar PubMed PubMed Central

9. van der Laan, MJ, Dudoit, S. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: finite sample oracle inequalities and examples. Berkeley: Division of Biostatistics, University of California; 2003. Technical Report 130.Search in Google Scholar

10. van der Vaart, AW, Dudoit, S, van der Laan, MJ. Oracle inequalities for multi-fold cross-validation. Stat Decis 2006;24:351–71. https://doi.org/10.1524/stnd.2006.24.3.351.10.1524/stnd.2006.24.3.351Search in Google Scholar

11. van der Laan, MJ, Dudoit, S, van der Vaart, AW. The cross-validated adaptive epsilon-net estimator. Stat Decis 2006;24:373–95. https://doi.org/10.1524/stnd.2006.24.3.373.Search in Google Scholar

12. van der Laan, MJ, Polley, EC, Hubbard, AE. Super learner. Stat Appl Genet Mol 2007;6:Article 25. https://doi.org/10.2202/1544-6115.1309.Search in Google Scholar PubMed

13. Polley, EC, Rose, S, van der Laan, MJ. Super learner. In van der Laan, MJ, Rose, S, editors. Targeted learning: causal inference for observational and experimental data. New York Dordrecht Heidelberg London: Springer; 2011.10.1007/978-1-4419-9782-1_3Search in Google Scholar

14. Neuhaus, G. On weak convergence of stochastic processes with multidimensional time parameter. Ann Stat 1971;42:1285–95. https://doi.org/10.1214/aoms/1177693241.Search in Google Scholar

15. van der Vaart, AW, Wellner, JA. Weak convergence and empirical processes. Berlin Heidelberg New York: Springer; 1996.10.1007/978-1-4757-2545-2Search in Google Scholar

16. van der Laan, MJ, Gruber, S. One-step targeted minimum loss-based estimation based on universal least favorable one-dimensional submodels. The International Journal of Biostatistics 2006;12:351–378. https://doi.org/10.1515/ijb-2015-0054.Search in Google Scholar PubMed PubMed Central

17. van der Laan, MJ. A generally efficient targeted minimum loss based estimator. Int J Biostat 2017;13:20150097. https://doi.org/10.1515/ijb-2015-0097.Search in Google Scholar PubMed PubMed Central

18. Bibaut, A, van der Laan, MJ. Fast rates for empirical risk minimization over cadlag functions with bounded sectional variation norm. Berkeley: Division of Biostatistics, University of California; 2019. Technical report.Search in Google Scholar

19. Davies, M, van der Laan, MJ. Sieve plateau variance estimators: A new approach to confidence interval estimation for dependent data. Berkeley: Division of Biostatistics, University of California; Working Paper Series; 2014. Technical report. http://biostats.bepress.com/ucbbiostat/paper322/.Search in Google Scholar

20. Cai, W, van der Laan, M. TMLEbootstrap: HAL-TMLE bootstrap in r; 2018. https://github.com/wilsoncai1992/TMLEbootstrap.Search in Google Scholar

21. Gill, RD. Non- and semiparametric maximum likelihood estimators and the von Mises method (part 1). Scand J Stat 1989;16:97–128. https://doi.org/10.1515/ijb-2012-0038.Search in Google Scholar PubMed

22. Coyle, J, van der Laan, MJ. Targeted bootstrap. In Targeted learning in data science. Springer International Publishing; 2018. p. 523–39.10.1007/978-3-319-65304-4_28Search in Google Scholar

23. Arcones, MA, Giné, E. The bootstrap of the mean with arbitrary bootstrap sample size. Annales de l’IHP Probabilités et statistiques 1989;25:457–81.Search in Google Scholar

24. Giné, E, Zinn, J. Necessary conditions for the bootstrap of the mean. Annals Stat 1989;17:684–91. https://doi.org/10.1214/aos/1176347134.Search in Google Scholar

25. Arcones, MA, Giné, E. On the bootstrap of M-estimators and other statistical functionals. In Exploring the limits of bootstrap, 13–47. Wiley New York; 1992. https://doi.org/10.1111/j.1468-0262.2005.00613.x.Search in Google Scholar

26. Tran, L, Petersen, M, Schwab, J, J van der Laan, M. Robust variance estimation and inference for causal effect estimation. Berkeley: Division of Biostatistics, University of California; 2018. Technical report. eprint arXiv:1810.03030.Search in Google Scholar

27. van der Vaart, AW, Wellner, JA. A local maximal inequality under uniform entropy. Electron J Stat 2011;5:192–203. http://doi.org/10.1214/11-EJS605.10.1214/11-EJS605Search in Google Scholar PubMed PubMed Central

Appendix

The HAL-MLEs on the original sample and bootstrap sample will be defined below as Q n = arg min Q Q ( M ) P n L 1 ( Q ) and Q n # = arg min Q Q ( M ) P n # L 1 ( Q ) , and, if we assume the extra structure (6) so that we know that Q ( M ) is itself defined as a space of cadlag functions with bounds on its sectional variation norm, then we let Q n # = arg min Q Q ( M ) , Q Q n , Q v * | | Q n | | v * P n # L 1 ( Q ) . In general, one can add restrictions to the parameter space over which one minimizes in the definition of Q n and Q n # as long as one guarantees that, with probability tending to 1, P n L 1 ( Q n ) P n L 1 ( Q 0 ) , and, with probability tending to 1, conditional on P n , P n # L 1 ( Q n # ) P n # L 1 ( Q n ) . For example, this allows one to use an upper bound C n u on the sectional variation norm in the definition of Q n if we know that C n u will be larger than the true C 0 u = | | Q 0 | | v * with probability tending to 1.

A Proof that the one-step TMLE Q n * preserves rate of convergence of Q n

The following lemma establishes that the one-step TMLE Q n * = Q n , ϵ n preserves the rate of convergence of Q n , where Q ϵ is a univariate local least favorable submodel through Q at ϵ = 0 . We use the notation L 1 ( Q 1 , Q 2 ) = L 1 ( Q 1 ) L 1 ( Q 2 ) .

Lemma 2

Let Q n = arg min Q Q ( M ) P n L 1 ( Q ) be the HAL-MLE of Q 0 , and let ϵ n = arg min ϵ P n L 1 ( Q n , ϵ ) for some parametric (e.g., local least favorable) submodel { Q ϵ : ϵ } M . Assume the bounds (2) , ( 4 ) on loss function L 1 ( Q ) , so that we also know d 01 ( Q n , Q 0 ) = O P ( n 1 / 2 α / 4 ) .

Then,

(22) d 01 ( Q n * , Q 0 ) = O P ( n 1 / 2 α / 4 ) .

Specifically, we have

d 01 ( Q n * , Q 0 ) ( P n P 0 ) L 1 ( Q n , ϵ n , Q n ) + d 01 ( Q n , Q 0 ) .

This also proves that the K-th step TMLE using a finite K (uniform in n) number of iterations satisfies d 01 ( Q n * , Q 0 ) d 01 ( Q n , Q 0 ) + O P ( n 1 / 2 α / 4 ) . So if r n = P n D * ( Q n , ϵ n , G n ) is not yet o P ( n 1 / 2 ) , then one should consider a K-th step TMLE to guarantee that r n is small enough to be neglected (we know that the fully iterated TMLE will solve P n D * ( Q n * , G n ) = 0 , but this one is harder to analyze).

Proof of Lemma 2: We have

P 0 L 1 ( Q n * ) P 0 L 1 ( Q 0 ) = P 0 L 1 ( Q n , ϵ n , Q n ) + P 0 L 1 ( Q n , Q 0 ) = ( P 0 P n ) L 1 ( Q n , ϵ n , Q n ) + P n L 1 ( Q n , ϵ n , Q n ) + d 01 ( Q n , Q 0 ) ( P n P 0 ) L 1 ( Q n , ϵ n , Q n ) + d 01 ( Q n , Q 0 )

Since L 1 ( Q n , ϵ n , Q n ) falls in class of cadlag functions with a universal bound on sectional variation norm (i.e., a Donsker class), and d 01 ( Q n , Q 0 ) = O P ( n 1 / 2 α / 4 ) , it follows that d 01 ( Q n * , Q 0 ) = O P ( n 1 / 2 ) + O P ( n 1 / 2 α / 4 ) . Now, we use that the L 2 ( P 0 ) -norm of L 1 ( Q n , ϵ n , Q n ) is bounded by sum of L 2 ( P 0 ) -norm of L 1 ( Q n * ) L 1 ( Q 0 ) and L 1 ( Q n ) L 1 ( Q 0 ) . These latter L 2 ( P 0 ) -norms can be bounded by d 01 1 / 2 ( Q n * , Q 0 ) and d 01 1 / 2 ( Q n , Q 0 ) , which thus converges at rate O P ( n 1 / 4 α / 2 ) . Again, by empirical process theory, using that we now know P 0 { L 1 ( Q n * , Q n ) } 2 = o P ( n 1 / 4 α / 2 ) , it follows immediately that d 01 ( Q n * , Q 0 ) = o P ( n 1 / 2 ) , but, by using the actual rate for this L 2 ( P 0 ) -norm, as in the Appendix in [7] it follows that d 01 ( Q n * , Q 0 ) = O P ( n 1 / 2 α / 4 ) .

B Asymptotic convergence of bootstrapped HAL-MLE: proof of theorem 3.

Theorem 7 below shows that both d n 1 ( Q n # , Q n ) = P n { L 1 ( Q n # ) L 1 ( Q n ) } and d 01 ( Q n # , Q 0 ) converge at rate n 1 / 2 α / 4 . The analog results apply to G n # .

Theorem 7

Consider a statistical model M satisfying (4), (2) on L 1 ( Q ) . Let Q n = arg min Q Q ( M ) P n L 1 ( Q ) and Q n # = arg min Q Q ( M ) P n # L 1 ( Q ) . In a model with extra structure (6) we define

Q n # = arg min Q Q ( M ) , Q Q n , | | Q n # | | v * | | Q n | | v * P n # L 1 ( Q ) .

Then,

d n 1 ( Q n # , Q n ) = O P ( n 1 / 2 α / 4 )   a n d   d 01 ( Q n # , Q 0 ) = O P ( n 1 / 2 α / 4 ) .

Proof of Theorem 7: We have

(23) 0 d n 1 ( Q n # , Q n ) P n { L 1 ( Q n # ) L 1 ( Q n ) } = ( P n # P n ) { L 1 ( Q n # ) L 1 ( Q n ) } + P n # { L 1 ( Q n # ) L 1 ( Q n ) } ( P n # P n ) { L 1 ( Q n # ) L 1 ( Q n ) } .

As a consequence, by empirical process theory [27], we have d n 1 ( Q n # , Q n ) = P n L 1 ( Q n # ) P n L 1 ( Q n ) is O P ( n 1 / 2 ) . We now note that

(24) d 01 ( Q n # , Q 0 ) = P 0 L 1 ( Q n # , Q n ) + P 0 L 1 ( Q n , Q 0 ) = ( P 0 P n ) L 1 ( Q n # , Q n ) + P n L 1 ( Q n # , Q n ) + d 01 ( Q n , Q 0 ) = ( P 0 P n ) L 1 ( Q n # , Q n ) + d n 1 ( Q n # , Q n ) + d 01 ( Q n , Q 0 ) .

Thus, it also follows that d 01 ( Q n # , Q 0 ) = O P ( n 1 / 2 ) . By assumption 2 this implies P 0 { L 1 ( Q n # ) L 1 ( Q 0 ) } 2 = O P ( n 1 / 2 ) . Note now that L 1 ( Q n # ) L 1 ( Q n ) = L 1 ( Q n # ) L 1 ( Q 0 ) + L 1 ( Q 0 ) L 1 ( Q n ) , using that P 0 { L 1 ( Q n ) L 1 ( Q 0 ) } 2 = O P ( n 1 / 2 ) , it follows that also P 0 { L 1 ( Q n # ) L 1 ( Q n ) } 2 = O P ( n 1 / 2 ) . By Lemma 3, it follows that also P n { L 1 ( Q n # ) L 1 ( Q n ) } 2 = O P ( n 1 / 2 ) . With this result in hand, using [27] as in Appendix in [7], it follows that ( P n # P n ) { L 1 ( Q n # ) L 1 ( Q n ) } = O P ( n 1 / 2 α / 4 ) . This proves that d n 1 ( Q n # , Q n ) = O P ( n 1 / 2 α / 4 ) . Using the same relation (24), this implies d 01 ( Q n # , Q n ) = O P ( n 1 / 2 α / 4 ) .

Lemma 3

Suppose that f n 2 d P n = O P ( n 1 / 2 α / 4 ) and we know that | | f n | | v * < M for some M < . Then f n 2 d P 0 = O P ( n 1 / 2 α / 4 ) .

Proof: We have

f n 2 d P 0 = f n 2 d ( P n P 0 ) + f n 2 d P n = f n 2 d ( P n P 0 ) + O P ( n 1 / 2 α / 4 ) .

We have f n 2 d ( P n P 0 ) = O P ( n 1 / 2 ) . This proves that f n 2 d P 0 = O P ( n 1 / 2 ) . By asymptotic equicontinuity of the empirical process indexed by cadlag functions with uniformly bounded sectional variation norm, it follows now also that f n 2 d ( P n P 0 ) = O P ( n 1 / 2 α / 4 ) (the same proof can be found in Theorem 1 of van der Laan [7] using Lemma 10 in the same paper). Thus, this proves that indeed that f n 2 d P 0 = O P ( n 1 / 2 α / 4 ) follows from f n 2 d P n = O P ( n 1 / 2 α / 4 ) .

C Proof that the one-step TMLE Q n # * preserves rate of convergence of Q n #

The following lemma establishes that the one-step TMLE Q n # * = Q n , ϵ n # preserves the rate of convergence d 01 ( Q n # , Q 0 ) = O P ( n 1 / 2 α / 4 ) of Theorem 3 of Q n # in sense that also d 01 ( Q n # * , Q 0 ) = O P ( n 1 / 2 α / 4 ) . Recall the notation L 1 ( Q 1 , Q 2 ) = L 1 ( Q 1 ) L 1 ( Q 2 ) .

Lemma 4

Let Q n = arg min Q Q ( M ) P n L 1 ( Q ) , and let ϵ n = arg min ϵ P n L 1 ( Q n , ϵ ) for a parametric submodel { Q n , ϵ : ϵ } M through Q n at ϵ = 0 . Assume (4), (2) so that we know d 01 ( Q n , Q 0 ) = O P ( n 1 / 2 α / 4 ) . By Lemma 2 we also have d 01 ( Q n * , Q 0 ) = O P ( n 1 / 2 α / 4 ) . Let Q n # = arg min Q Q ( M ) P n # L 1 ( Q ) be the HAL-MLE on the bootstrap sample. By Theorem 7 we also have d n 1 ( Q n # , Q n ) = O P ( n 1 / 2 α / 4 ) , where d n 1 ( Q , Q n ) = P n L 1 ( Q ) P n L 1 ( Q n ) , and d 01 ( Q n # , Q 0 ) = O P ( n 1 / 2 α / 4 ) . Let ϵ n # = arg min ϵ P n # L 1 ( Q n , ϵ # ) , and Q n # * = Q n , ϵ n # # * .

Then,

(25) d 01 ( Q n # * , Q 0 ) = O P ( n 1 / 2 α / 4 ) .

Proof of Lemma 4: Firstly, we note that

d 01 ( Q n # * , Q 0 ) = P 0 L 1 ( Q n # * , Q n * ) + P 0 L 1 ( Q n * , Q 0 ) = ( P 0 P n ) L 1 ( Q n # * , Q n * ) + P n L 1 ( Q n # * , Q n * ) + d 01 ( Q n * , Q 0 ) = ( P 0 P n ) L 1 ( Q n # * , Q n * ) + d n 1 ( Q n # * , Q n * ) + d 01 ( Q n * , Q 0 ) = ( P 0 P n ) L 1 ( Q n # * , Q n * ) + d n 1 ( Q n # * , Q n * ) + O P ( n 1 / 2 α / 4 ) .

Using that d n 1 ( Q n # , Q n ) , d 01 ( Q n , Q 0 ) , d 01 ( Q n * , Q 0 ) , and (thereby also, by [7]) ( P n P 0 ) L 1 ( Q n , Q n * ) = O P ( n 1 / 2 α / 4 ) are all four O P ( n 1 / 2 α / 4 ) we obtain

d n 1 ( Q n # * , Q n * ) = P n L 1 ( Q n , ϵ n # # ) P n L 1 ( Q n , ϵ n ) = P n L 1 ( Q n , ϵ n # # , Q n # ) + P n L 1 ( Q n # , Q n ) + P n L 1 ( Q n , Q n , ϵ n ) = ( P n P n # ) L 1 ( Q n , ϵ n # # , Q n # ) + P n # L 1 ( Q n , ϵ n # # , Q n # ) + d n 1 ( Q n # , Q n ) + ( P n P 0 ) L 1 ( Q n , Q n * ) + P 0 L 1 ( Q n , Q 0 ) + P 0 L 1 ( Q n * , Q 0 ) ( P n P n # ) L 1 ( Q n , ϵ n # # , Q n # ) + d n 1 ( Q n # , Q n ) + ( P n P 0 ) L 1 ( Q n , Q n * ) + d 01 ( Q n , Q 0 ) + d 01 ( Q n * , Q 0 ) = ( P n P n # ) L 1 ( Q n , ϵ n # # , Q n # ) + O P ( n 1 / 2 α / 4 ) .

Plugging this bound for d n 1 ( Q n # * , Q n * ) in our expression above for d 01 ( Q n # * , Q 0 ) yields:

d 01 ( Q n # * , Q 0 ) ( P 0 P n ) L 1 ( Q n # * , Q n * ) + ( P n P n # ) L 1 ( Q n # * , Q n # ) + O P ( n 1 / 2 α / 4 ) .

By assumption, we have that L 1 ( Q n # * ) , L 1 ( Q n # ) , L 1 ( Q n * ) are elements of the class of cadlag functions with universal bound on sectional variation norm, which is a uniform Donsker class. By empirical process theory for the empirical process ( ( P n P 0 ) f : f ) and, conditional on P n , for ( ( P n # P n ) f : f ) indexed by this Donsker class, it follows that d 01 ( Q n # * , Q 0 ) = O P ( n 1 / 2 ) + O P ( n 1 / 2 α / 4 ) . With this result in hand, we now revisit the 2 empirical process terms in the above bound for d 01 ( Q n # * , Q 0 ) so that the O P ( n 1 / 2 ) improves to O P ( n 1 / 2 α / 4 ) . First, consider the second term. The L 2 ( P n ) -norm of L 1 ( Q n # * , Q n # ) is bounded by the sum of the L 2 ( P n ) -norms of L 1 ( Q n # * , Q 0 ) and L 1 ( Q n # , Q 0 ) . The L 2 ( P n ) -norm of L 1 ( Q n # * , Q 0 ) is equivalent to L 2 ( P 0 ) -norm of L 1 ( Q n # * , Q 0 ) (see Lemma 3), which was just shown to be O P ( n 1 / 4 ) . The L 2 ( P n ) -norm of L 1 ( Q n # , Q 0 ) can be bounded as sum of L 2 ( P n ) -norms of L 1 ( Q n # , Q n ) and L 1 ( Q n , Q 0 ) . These can be bounded in terms of d n 1 1 / 2 ( Q n # , Q n ) and d 01 1 / 2 ( Q n , Q 0 ) (using Lemma 3 again). Thus, the L 2 ( P n ) norm of L 1 ( Q n # * , Q n # ) is O P ( n 1 / 4 ) , so that we can establish again that ( P n P n # ) L 1 ( Q n # * , Q n # ) = O P ( n 1 / 2 α / 4 ) .

Consider now the first empirical process term ( P n P 0 ) L 1 ( Q n # * , Q n * ) . The L 2 ( P 0 ) -norm of L 1 ( Q n # * , Q n * ) can be bounded in terms of L 2 ( P 0 ) -norms of L 1 ( Q n # * , Q 0 ) and L 1 ( Q n * , Q 0 ) , which thus is O P ( n 1 / 4 ) as well. Therefore, it also follows that ( P n P 0 ) L 1 ( Q n # * , Q n * ) = O P ( n 1 / 2 α / 4 ) . This proves that d 01 ( Q n # * , Q 0 ) = O P ( n 1 / 2 α / 4 ) .

D Proof of theorem 4

Firstly, by definition of the remainder R 20 ( ) we have the following two expansions:

Ψ ( Q n # * ) Ψ ( Q 0 ) = ( P n # P 0 ) D * ( Q n # * , G n # ) + R 20 ( Q n # * , G n # , Q 0 , G 0 ) = ( P n # P n ) D * ( Q n # * , G n # ) + ( P n P 0 ) D * ( Q n # * , G n # ) + R 20 ( Q n # * , G n # , Q 0 , G 0 ) , Ψ ( Q n * ) Ψ ( Q 0 ) = ( P n P 0 ) D * ( Q n * , G n ) + R 20 ( Q n * , G n , Q 0 , G 0 ) ,

where we ignored r n = P n D * ( Q n * , G n ) and its bootstrap analog r n # = P n # D * ( Q n # * , G n # ) (which were both assumed to be o P ( n 1 / 2 ) ). Subtracting the first equality from the second equality yields:

(26) Ψ ( Q n # * ) Ψ ( Q n * ) = ( P n # P n ) D * ( Q n # * , G n # ) + R 20 ( Q n # * , G n # , Q 0 , G 0 ) R 20 ( Q n * , G n , Q 0 , G 0 ) .

Under the conditions of Theorem 1, we already established that R 20 ( Q n * , G n , Q 0 , G 0 ) = O P ( n 1 / 2 α / 4 ) . By assumption (8), we can bound the first remainder R 20 ( Q n # * , G n # , Q 0 , G 0 ) by f ( d 01 1 / 2 ( Q n # * , Q 0 ) , d 02 1 / 2 ( G n # , G 0 ) ) . Theorem 3 established that d 01 ( Q n # , Q 0 ) = O P ( n 1 / 2 α / 4 ) and d 02 ( G n # , G 0 ) = O P ( n 1 / 2 α / 4 ) . Using the fact that f is a quadratic polyonomial, this now also establishes that

R 20 ( Q n # * , G n # , Q 0 , G 0 ) = O P ( n 1 / 2 α / 4 ) .

It remains to analyze the two leading empirical process terms in (26). By our continuity assumption (9) on the efficient influence curve as function in ( Q , G ) , we have that convergence of d 01 ( Q n # * , Q 0 ) + d 02 ( G n # , G 0 ) to zero implies convergence of the square of the L 2 ( P 0 ) -norm of D * ( Q n # * , G n # ) D * ( Q 0 , G 0 ) at the same rate in probability. Since we already established convergence of D * ( Q n * , G n ) D * ( Q 0 , G 0 ) to zero, this also establishes this result for the L 2 ( P 0 ) -norm of D * ( Q n # * , G n # ) D * ( Q n * , G n ) . By Lemma 3 this also proves that the L 2 ( P n ) -norm of the latter converges to zero in probability. By empirical process theory [27] (as in Appendix of [7]), this teaches us that ( P n # P n ) D * ( Q n # * , G n # ) = ( P n # P n ) D * ( Q n * , G n ) + O P ( n 1 / 2 α / 4 ) . This deals with the first leading term in (26).

By our continuity condition (9) we also have that

P 0 { D * ( Q n # * , G n # ) D * ( Q n * , G n ) } 2 p 0

at this rate. Again, by [27] this shows ( P n P 0 ) { D * ( Q n # * , G n # ) D * ( Q n * , G n ) } = O P ( n 1 / 2 α / 4 ) . Thus we have shown that

( P n # P n ) D * ( Q n # * , G n # ) + ( P n P 0 ) { D * ( Q n # * , G n # ) D * ( Q n * , G n ) } = ( P n # P n ) D * ( Q n * , G n ) + O P ( n 1 / 2 α / 4 ) .

Thus, we have now shown, conditional on ( P n : n 1 ) ,

n 1 / 2 ( Ψ ( Q n # * ) Ψ ( Q n * ) ) = n 1 / 2 ( P n # P n ) D * ( Q n * , G n ) + o P ( 1 ) d N ( 0 , σ 0 2 ) .

This completes the proof of the Theorem for the HAL-TMLE. For a model M with extra structure (6), this gives the result for the HAL-TMLE at the fixed C u . However, it follows straightforwardly that this proof applies uniformly in any C in between C 0 and C u , and thereby to a selector C n satisfying (16).

E Understanding why d n 1 ( Q n # , Q n ) is a quadratic dissimilarity

Lemma 5

Assume extra model structure (6) on M . Let P n R 2 L 1 , n ( Q n # , Q n ) be defined as the exact second-order remainder of a first order Taylor expansion of P n L 1 ( Q ) at Q n :

P n { L 1 ( Q n # ) L 1 ( Q n ) } = P n d d Q n L 1 ( Q n ) ( Q n # Q n ) + P n R 2 L 1 , n ( Q n # , Q n ) ,

where d d Q n L 1 ( Q n ) ( h ) = d d ϵ L 1 ( Q n + ϵ h ) | ϵ = 0 is the directional derivative in direction h.

We have P n d d Q n L 1 ( Q n ) ( Q n # Q n ) 0 so that

d n 1 ( Q n # , Q n ) P n R 2 L 1 , n ( Q n # , Q n ) .

In order to provide the reader a concrete example of what this empirical dissimilarity d n 1 ( Q n # , Q n ) looks like, we provide here the corollary of Lemma 5 for the squared error loss.

Corollary 1 Consider the definitions of Lemma 5 and apply it to loss function L 1 ( Q ) ( O ) = ( Y Q ( X ) ) 2 . Then, P n R 2 L 1 , n ( Q n # , Q n ) = P n ( Q n # Q n ) 2 , so that we have

d n 1 ( Q n # , Q n ) P n ( Q n # Q n ) 2 .

Since P n { L 1 ( Q n # ) L 1 ( Q n ) } 2 = O P ( P n ( Q n # Q n ) 2 ) , this implies P n { L 1 ( Q n # ) L 1 ( Q n ) } 2 = O P ( d n 1 ( Q n # , Q n ) ) .

Proof of Corollary: We will prove P n R 2 L 1 , n ( Q n # , Q n ) = P n ( Q n # Q n ) 2 . The remaining statement is then just an immediate corollary of Lemma 5. We have

d n 1 ( Q n # , Q n ) = 1 n i { 2 Y i Q n ( X i ) 2 Y i Q n # ( X i ) + Q n # 2 ( X i ) Q n 2 ( X i ) } = 1 n i { 2 ( Q n Q n # ) ( X i ) Y i + Q n # 2 ( X i ) Q n 2 ( X i ) } = 1 n i { 2 ( Q n Q n # ) ( X i ) ( Y i Q n ( X i ) ) + 2 ( Q n Q n # ) Q n ( X i ) + Q n # 2 ( X i ) Q n 2 ( X i ) } = 1 n i 2 ( Q n Q n # ) ( X i ) ( Y i Q n ( X i ) ) + 1 n i ( Q n Q n # ) 2 ( X i ) .

Note that the first term corresponds with P n d d Q n L 1 ( Q n ) ( Q n # Q n ) and the second-order term with P n R 2 L 1 , n ( Q n # , Q n ) , where R 2 L 1 , n ( Q n # , Q n ) = ( Q n # Q n ) 2 .

Proof of Lemma 5: We need to prove that the linear approximation

P n d d Q n L 1 ( Q n ) ( Q n # Q n ) 0 .

The extra model structure (6) allows the explicit calculation of score equations for the HAL-MLE and its bootstrap analog, which provides us then with the desired inequality.

Consider the h-specific path

Q n , ϵ h ( x ) = ( 1 + ϵ h ( 0 ) ) Q n ( 0 ) + s ( 0 s , x s ] ( 1 + ϵ h s ( u s ) ) d Q n , s ( u s ) )

for ϵ [ 0 , δ ) for some δ > 0 , where h is uniformly bounded, and, if C l < C u ,

r ( h , Q n ) h ( 0 ) | Q n ( 0 ) | + s ( 0 s , τ s ] h s ( u s ) | d Q n , s ( u s ) | 0 ,

while if C l = C u , then r ( h , Q n ) = 0 . Let H = { h : r ( h , Q n ) 0 , h < } be the set of possible functions h (i.e., functions of s , u s ), which defines a collection of paths { Q n , ϵ h : ϵ } indexed by h H . Consider a given h H and let’s denote this path with Q n , ϵ , suppressing the dependence on h in the notation. For ϵ 0 small enough we have ( 1 + ϵ h ( 0 ) ) > 0 and 1 + ϵ h s ( u s ) > 0 . Thus, for ϵ 0 small enough we have

| | Q n , ϵ | | v * = ( 1 + ϵ h ( 0 ) ) | Q n ( 0 ) | + s ( 0 s , τ s ] ( 1 + ϵ h s ( u s ) ) | d Q n , s ( u s ) | = | | Q n | | v * + ϵ { h ( 0 ) | Q n ( 0 ) | + s ( 0 s , τ s ] h s ( u s ) | d Q n , s ( u s ) | } = | | Q n | | v * + ϵ r ( h , Q n ) | | Q n | | v * ,

by assumption that r ( h , Q n ) 0 . If C l = C u and thus r ( h , Q n ) = 0 , then the above shows | | Q n , ϵ | | v * = | | Q n | | v * . Thus, for a small enough δ > 0 { Q n , ϵ : 0 ϵ < δ } represents a path of cadlag functions with sectional variation norm bounded from below and above: C l | | Q n | | v * C u . In addition, we have that d Q n , s ( u s ) = 0 implies ( 1 + ϵ h s ( u s ) ) d Q n , s ( u s ) = 0 so that the support of Q n , ϵ is included in the support A of Q n as defined by F A n p . Thus, this proves that for δ > 0 small enough this path { Q n , ϵ : 0 ϵ δ } is indeed a submodel of the parameter space of Q, defined as F A n p or F A n p + .

We also have that

Q n , ϵ Q n = ϵ { Q n ( 0 ) h ( 0 ) + s ( 0 s , x s ] h s ( u s ) d Q n , s ( u s ) } .

Thus, this path generates a direction f ( h , Q n ) at ϵ = 0 given by:

d d ϵ Q n , ϵ = f ( h , Q n ) Q n ( 0 ) h ( 0 ) + s ( 0 s , x s ] h s ( u s ) d Q n , s ( u s ) .

Let S { f ( h , Q n ) : h H } be the collection of directions generated by our family of paths. By definition of the MLE Q n , we also have that ϵ P n L 1 ( Q n , ϵ ) is minimal over [ 0 , δ ) at ϵ = 0 . This shows that the derivative of P n L 1 ( Q n , ϵ ) from the right at ϵ = 0 is non-negative:

d d ϵ + P n L 1 ( Q n , ϵ ) 0  at  ϵ = 0 .

This derivative is given by P n d d Q n L 1 ( Q n ) ( f ( h , Q n ) ) , where d / d Q n L 1 ( Q n ) ( f ( h , Q n ) ) is the directional (Gateaux) derivative of Q L 1 ( Q ) at Q n in in direction f ( h , Q n ) . Thus for each h H , we have

P n d d Q n L 1 ( Q n ) ( f ( h , Q n ) ) 0 .

Suppose that

(27) Q n # Q n S = { f ( h , Q n ) : h H } .

Then, we have

P n d d Q n L 1 ( Q n ) ( Q n # Q n ) 0 .

Combined with the stated second-order Taylor expansion of P n L 1 ( Q ) at Q = Q n with exact second-order remainder P n R 2 L 1 , n ( Q n # , Q n ) , this proves

P n { L 1 ( Q n # ) L 1 ( Q n ) } P n R 2 L 1 , n ( Q n # , Q n ) .

Thus it remains to show (27).

In order to prove (27), let’s solve explicitly for h so that Q n # Q n = f ( h , Q n ) and then verify that h H satisfies its assumed constraints (i.e., r ( h , Q n ) 0 if C l < C u or r ( h , Q n ) = 0 if C l = C u , and h is uniformly bounded). We have

Q n # Q n = Q n # ( 0 ) Q n ( 0 ) + s ( 0 s , x s ] d ( Q n , s # d Q n , s ) ( u s ) = Q n # ( 0 ) Q n ( 0 ) + s ( 0 s , x s ] d ( Q n , s # d Q n , s ) d Q n , s d Q n , s ( u s ) ,

where we used that Q n , s # Q n , s for each subset s. Let h ( Q n # , Q n ) be defined by

h ( Q n # , Q n ) ( 0 ) = ( Q n # ( 0 ) Q n ( 0 ) ) / Q n ( 0 ) h s ( Q n # , Q n ) = d ( Q n , s # d Q n , s ) d Q n , s  for  all subsets s .

For this choice h ( Q n # , Q n ) , we have f ( h , Q n ) = Q n # Q n . First, consider the case Q ( M ) = F A n p or Q ( M ) = F A n p + , but C l < C u . We now need to verify that r ( h , Q n ) 0 for this choice h = h ( Q n # , Q n ) . We have

r ( h , Q n ) = Q n # ( 0 ) Q n ( 0 ) Q n ( 0 ) | Q n ( 0 ) | + s ( 0 s , τ s ] d Q n , s # d Q n , s d Q n , s | d Q n , s | = I ( Q n ( 0 ) > 0 ) { Q n # ( 0 ) Q n ( 0 ) } + I ( Q n ( 0 ) 0 ) { Q n ( 0 ) Q n # ( 0 ) } + s ( 0 s , τ s ] I ( d Q n , s 0 ) d ( Q n , s # d Q n , s ) + s ( 0 s , τ s ] I ( d Q n , s < 0 ) d ( Q n , s Q n , s # ) = Q n v * + Q n # ( 0 ) { I ( Q n ( 0 ) > 0 ) I ( Q n ( 0 ) 0 ) } + s ( 0 s , τ s ] { I ( d Q n , s 0 ) I ( d Q n , s 0 } d Q n , s # | | Q n | | v * + | Q n # ( 0 ) | + s ( 0 s , τ s ] | d Q n , s # ( u s ) | = Q n v * + Q n # v * 0 ,

since Q n # v * Q n v * , by assumption. Thus, this proves that indeed r ( h , Q n ) 0 and thus that Q n # Q n S . Consider now the case that Q ( M ) = F A n p + and C l = C u . Then Q n v * = | | Q n # | | v * = C u . We now need to show that r ( h , Q n ) = 0 for this choice h = h ( Q n # , Q n ) . We now use the same three equalities as above, but now use that d Q n , s ( u s ) 0 and Q n ( 0 ) 0 , by definition of F A n p + , which then shows r ( h , Q n ) = 0 .

This proves (27) and thereby completes the proof of Lemma 5.

F Number of non-zero HAL coefficients as a function of sample size

Figure 6: 
The number of non-zero coefficients in 




Q
n



${Q}_{n}$


 as a function of sample size (under simulation 1 at 




a
1

=
0.5


${a}_{1}=0.5$


).
Figure 6:

The number of non-zero coefficients in Q n as a function of sample size (under simulation 1 at a 1 = 0.5 ).

G Extend the F class to be shifted by an unbounded constant

The below result and proof was provided to us by the reviewer.

Lemma 6

Let F be a class of functions for which there exists some M > 0 such that | | f | | M for all f F . If G : = { x a + f : a [ M , M ] , f F } , then, for all ϵ ( 0 , 1 ] ,

sup Q log N ( 2 ϵ , G , L 2 ( Q ) ) sup Q log N ( ϵ , F , L 2 ( Q ) ) log ( ϵ ) + log ( 2 M + 1 ) .

Proof. Fix Q and let f 1 , , f N be an ϵ-covering of F with respect to the L 2 ( Q ) pseudometric. Also let a 1 , , a 2 M / ϵ denote an ϵ-cover of [ M , M ] . Fix g G . We know that there exists an a R and f F such that g ( ) = a + f ( ) . We also know that there also exists a j ( f ) { 1 , , 2 M / ϵ } and k ( f ) { 1 , , N } such and | | a a j ( f ) | | ϵ and | | f f k ( f ) | | L 2 ( Q ) ϵ . Hence, | | g a j ( f ) f k ( f ) | | L 2 ( Q ) 2 ϵ . Consequently, { a j + f k : k { 1 , , 2 M / ϵ } , j { 1 , , N } } is a 2 ϵ -cover of G with respect to the L 2 ( Q ) pseudometric. Hence,

(28) log N ( 2 ϵ , G , L 2 ( Q ) ) log N ( ϵ , F , L 2 ( Q ) ) + log ( 2 M / ϵ ) log N ( ϵ , F , L 2 ( Q ) ) + log ( 2 M / ϵ + 1 ) = log N ( ϵ , F , L 2 ( Q ) ) + log ( ( 2 M + 1 ) / ϵ ) .

Taking a supremum in Q on both sides completes the proof.

Lemma 7

Consider the setting of Lemma 6. Fix δ > 0 , and let

G δ : = { g 1 g 2 : g 1 , g 2 G , | | g 1 g 2 | | L 2 ( P ) < δ } , H δ : = { a 1 + f 1 a 2 f 2 : f 1 , f 2 F , a 1 , a 2 R , | | a 1 + f 1 a 2 f 2 | | L 2 ( P ) < δ } .

It holds that

(29) sup h H δ | ( P n P ) h | = sup g G δ | ( P n P ) g | .

Proof. Clearly G δ H δ , and so

sup h H δ | ( P n P ) h | sup g G δ | ( P n P ) g | .

We now show the other direction. Fix h H δ . We know that there exist a 1 , a 2 R and f 1 , f 2 F such that h ( ) = a 1 + f 1 ( ) a 2 f 2 ( ) . Moreover, it holds that

(30) ( P n P ) ( a 1 + f 1 a 2 f 2 ) = ( P n P ) ( f 1 P f 1 f 2 + P f 2 ) .

By the bounds on f, P f 1 , P f 2 [ M , M ] . Hence, f 1 ( ) P f 1 and f 2 ( ) P f 2 belong to G . Moreover, because the variance is upper bounded by the raw second moment,

| | f 1 P f 1 f 2 + P f 2 | | L 2 ( P ) | | a 1 + f 1 a 2 f 2 | | L 2 ( P ) .

Now, as h H δ , the right-hand side is upper bounded by δ. Hence, f 1 P f 1 f 2 + P f 2 belongs to G δ . Consequently, taking an absolute value of both sides of (30) and then a supremum over g G δ on the right yields that

(31) | ( P n P ) ( a 1 + f 1 a 2 f 2 ) | = | ( P n P ) ( f 1 P f 1 f 2 + P f 2 ) | sup g G δ | ( P n P ) g |

As h = a 1 + f 1 a 2 f 2 was an arbitrary element of H δ , the proof is complete.

Received: 2017-08-30
Accepted: 2020-04-14
Published Online: 2020-08-10

© 2020 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 24.4.2024 from https://www.degruyter.com/document/doi/10.1515/ijb-2017-0070/html
Scroll to top button