Asymptotic linear expansion of regularized M-estimators

Werner, Tino

doi:10.1007/s10463-021-00792-5

Asymptotic linear expansion of regularized M-estimators

Published: 24 March 2021

Volume 74, pages 167–194, (2022)
Cite this article

Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Tino Werner¹

385 Accesses
Explore all metrics

Abstract

Parametric high-dimensional regression requires regularization terms to get interpretable models. The respective estimators correspond to regularized M-functionals which are naturally highly nonlinear. Their Gâteaux derivative, i.e., their influence curve linearizes the asymptotic bias of the estimator, but only up to a remainder term which is not guaranteed to tend (sufficiently fast) to zero uniformly on suitable tangent sets without profound arguments. We fill this gap by studying, in a unified framework, under which conditions the M-functionals corresponding to convex penalties as regularization are compactly differentiable, so that the estimators admit an asymptotically linear expansion. This key ingredient allows influence curves to reasonably enter model diagnosis and enable a fast, valid update formula, just requiring an evaluation of the corresponding influence curve at new data points. Moreover, this paves the way for optimally-robust estimators, bounding the influence curves in a suitable way.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sum-of-Squares Relaxations for Information Theory and Variational Inference

Article 05 April 2024

Francis Bach

Random Gradient-Free Minimization of Convex Functions

Article 30 November 2015

Yurii Nesterov & Vladimir Spokoiny

Finding global minima via kernel approximations

Article 04 April 2024

Alessandro Rudi, Ulysse Marteau-Ferey & Francis Bach

References

Alfons, A., Croux, C., Gelper, S. (2013). Sparse least trimmed squares regression for analyzing high-dimensional large data sets. The Annals of Applied Statistics, 7(1), 226–248.
Article MathSciNet MATH Google Scholar
Alqallaf, F., Van Aelst, S., Yohai, V. J., Zamar, R. H. (2009). Propagation of outliers in multivariate data. The Annals of Statistics, 37(1), 311–331.
Article MathSciNet MATH Google Scholar
Aravkin, A. Y., Burke, J. V., Pillonetto, G. (2013). Sparse/robust estimation and Kalman smoothing with nonsmooth log-concave densities: Modeling, computation, and theory. The Journal of Machine Learning Research, 14(1), 2689–2728.
MathSciNet MATH Google Scholar
Avella-Medina, M. (2017). Influence functions for penalized M-estimators. Bernoulli, 23(4B), 3178–3196.
Article MathSciNet MATH Google Scholar
Averbukh, V., Smolyanov, O. (1967). The theory of differentiation in linear topological spaces. Russian Mathematical Surveys, 22(6), 201–258.
Article MATH Google Scholar
Banerjee, O., Ghaoui, L. E., d’Aspremont, A. (2008). Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. Journal of Machine learning Research, 9, 485–516.
MathSciNet MATH Google Scholar
Berge, C. (1963). Topological Spaces: Including a treatment of multi-valued functions, vector spaces, and convexity. Courier Corporation.
Beutner, E., Zähle, H. (2010). A modified functional delta method and its application to the estimation of risk functionals. Journal of Multivariate Analysis, 101(10), 2452–2463.
Article MathSciNet MATH Google Scholar
Beutner, E., Zähle, H. (2016). Functional delta-method for the bootstrap of quasi-hadamard differentiable functionals. Electronic Journal of Statistics, 10(1), 1181–1222.
Article MathSciNet MATH Google Scholar
Bühlmann, P. (2006). Boosting for high-dimensional linear models. The Annals of Statistics, 34(2), 559–583.
Article MathSciNet MATH Google Scholar
Bühlmann, P., Hothorn, T. (2007). Boosting algorithms: Regularization, prediction and model fitting. Statistical Science, 22(4), 477–505.
MathSciNet MATH Google Scholar
Bühlmann, P., Van De Geer, S. (2011). Statistics for high-dimensional data: Methods, theory and applications. Springer Science & Business Media.
Chang, L., Roberts, S., Welsh, A. (2018). Robust lasso regression using tukey’s biweight criterion. Technometrics, 60(1), 36–47.
Article MathSciNet Google Scholar
Chen, X., Wang, Z. J., McKeown, M. J. (2010a). Asymptotic analysis of robust lassos in the presence of noise with large variance. IEEE Transactions on Information Theory, 56(10), 5131–5149.
Article MathSciNet MATH Google Scholar
Chen, X., Wang, Z. J., McKeown, M. J. (2010b). Asymptotic analysis of the Huberized lasso estimator. In 2010 IEEE International conference on acoustics speech and signal processing (ICASSP),, pages 1898–1901. IEEE.
Christmann, A., Steinwart, I. (2004). On robustness properties of convex risk minimization methods for pattern recognition. Journal of Machine Learning Research, 5, 1007–1034.
MathSciNet MATH Google Scholar
Christmann, A., Van Messem, A. (2008). Bouligand derivatives and robustness of support vector machines for regression. Journal of Machine Learning Research, 9(May), 915–936.
MathSciNet MATH Google Scholar
Christmann, A., Van Messem, A., Steinwart, I. (2009). On consistency and robustness properties of support vector machines for heavy-tailed distributions. Statistics and Its Interface, 2(3), 311–327.
Article MathSciNet MATH Google Scholar
Clarke, B. R. (1983). Uniqueness and fréchet differentiability of functional solutions to maximum likelihood type equations. The Annals of Statistics, 11(4), 1196–1205.
Article MathSciNet MATH Google Scholar
Clémençon, S., Lugosi, G., Vayatis, N. (2008). Ranking and empirical minimization of U-statistics. The Annals of Statistics, 36(2), 844–874.
Article MathSciNet MATH Google Scholar
Clémençon, S., Depecker, M., Vayatis, N. (2013). An empirical comparison of learning algorithms for nonparametric scoring: The TreeRank algorithm and other methods. Pattern Analysis and Applications, 16(4), 475–496.
Article MathSciNet MATH Google Scholar
De los Reyes, J. C., Schönlieb, C.-B., Valkonen, T. . (2016). The structure of optimal parameters for image restoration problems. Journal of Mathematical Analysis and Applications, 434(1), 464–500.
Article MathSciNet MATH Google Scholar
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R. (2004). Least angle regression. The Annals of Statistics, 32(2), 407–499.
Article MathSciNet MATH Google Scholar
Evgrafov, A., Patriksson, M. (2003). Stochastic structural topology optimization: discretization and penalty function approach. Structural and Multidisciplinary Optimization, 25(3), 174–188.
Article MathSciNet MATH Google Scholar
Fan, J., Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360.
Article MathSciNet MATH Google Scholar
Fernholz, L. (1983). Lecture notes in statistics. In Von Mises Calculus for Statistical Functionals, volume 19. Springer.
Fraiman, R., Yohai, V. J., Zamar, R. H. (2001). Optimal robust m-estimates of location. The Annals of Statistics, 29(1), 194–223.
Article MathSciNet MATH Google Scholar
Fréchet, M. (1937). Sur la notion de différentielle dans l’analyse générale.
Friedman, J., Hastie, T., Höfling, H., Tibshirani, R. (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2), 302–332.
Article MathSciNet MATH Google Scholar
Friedman, J., Hastie, T., Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3), 432–441.
Article MATH Google Scholar
Gill, R. D., Wellner, J. A., Præstgaard, J. (1989). Non- and semi-parametric maximum likelihood estimators and the von mises method (part 1)[with discussion and reply]. Scandinavian Journal of Statistics, 16(2), 97–128.
MathSciNet Google Scholar
Hable, R. (2012). Asymptotic normality of support vector machine variants and other regularized kernel methods. Journal of Multivariate Analysis, 106, 92–117.
Article MathSciNet MATH Google Scholar
Hampel, F. (1974). The influence curve and its role in robust estimation. Journal of the American Statistical Association, 69(346), 383–393.
Article MathSciNet MATH Google Scholar
Hampel, F., Ronchetti, E., Rousseeuw, P., Stahel, W. (2011). Robust statistics: The approach based on influence functions, (Vol. 114). John Wiley & Sons.
Huber, P. J., Ronchetti, E. (2009). Robust Statistics. Wiley.
Jain, N., Marcus, M. (1975). Central limit theorems for C(S)-valued random variables. Journal of Functional Analysis, 19(3), 216–231.
Article MathSciNet MATH Google Scholar
Kohl, M. (2005). Numerical contributions to the asymptotic theory of robustness. PhD thesis, University of Bayreuth.
Krätschmer, V., Schied, A., Zähle, H. (2012). Qualitative and infinitesimal robustness of tail-dependent statistical functionals. Journal of Multivariate Analysis, 103(1), 35–47.
Article MathSciNet MATH Google Scholar
Lambert-Lacroix, S., Zwald, L. (2011). Robust regression through the Huber’s criterion and adaptive lasso penalty. Electronic Journal of Statistics, 5, 1015–1053.
Article MathSciNet MATH Google Scholar
LeCam, L. (1970). On the assumptions used to prove asymptotic normality of maximum likelihood estimates. The Annals of Mathematical Statistics, 41(3), 802–828.
Article MathSciNet Google Scholar
LeDell, E., Petersen, M., van der Laan, M. (2015). Computationally efficient confidence intervals for cross-validated area under the ROC curve estimates. Electronic Journal of Statistics, 9(1), 1583.
Article MathSciNet MATH Google Scholar
Lee, S. (2015). An additive sparse penalty for variable selection in high-dimensional linear regression model. Communications for Statistical Applications and Methods, 22(2), 147–157.
Article Google Scholar
Levitin, E., Tichatschke, R. (1998). On smoothing of parametric minimax-functions and generalized max-functions via regularization. Journal of Convex Analysis, 5, 199–220.
MathSciNet MATH Google Scholar
Loh, P.-L. (2017). Statistical consistency and asymptotic normality for high-dimensional robust M-estimators. The Annals of Statistics, 45(2), 866–896.
Article MathSciNet MATH Google Scholar
Loh, P.-L., Wainwright, M. J. (2015). Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Journal of Machine Learning Research, 16, 559–616.
MathSciNet MATH Google Scholar
Maronna, R., Martin, R., Yohai, V. (2006). Robust statistics: Theory and methods. Annals of Statistics, 30, 17–23.
MATH Google Scholar
Negahban, S. N., Ravikumar, P., Wainwright, M. J., Yu, B. (2012). A unified framework for high-dimensional analysis of $m$-estimators with decomposable regularizers. Statistical Science, 27(4), 538–557.
Article MathSciNet MATH Google Scholar
Öllerer, V., Croux, C., Alfons, A. (2015). The influence function of penalized regression estimators. Statistics, 49(4), 741–765.
Article MathSciNet MATH Google Scholar
Osborne, M. R., Presnell, B., Turlach, B. A. (2000). A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis, 20(3), 389–403.
Article MathSciNet MATH Google Scholar
Pötscher, B. M., Leeb, H. (2009). On the distribution of penalized maximum likelihood estimators: The lasso, scad, and thresholding. Journal of Multivariate Analysis, 100(9), 2065–2082.
Article MathSciNet MATH Google Scholar
Pötscher, B. M., Schneider, U. (2009). On the distribution of the adaptive lasso estimator. Journal of Statistical Planning and Inference, 139(8), 2775–2790.
Article MathSciNet MATH Google Scholar
Pupashenko, D., Ruckdeschel, P., Kohl, M. (2015). L2 differentiability of generalized linear models. Statistics & Probability Letters, 97(C), 155–164.
Reeds, J. (1976). On the definition of von Mises functionals. PhD thesis, Harvard University.
Rieder, H. (1994). Robust asymptotic statistics (Vol. 1). Springer Science & Business Media.
Rieder, H., Kohl, M., Ruckdeschel, P. (2008). The cost of not knowing the radius. Statistical Methods & Applications, 17(1), 13–40.
Article MathSciNet MATH Google Scholar
Rieder, H., Ruckdeschel, P. (2001). Short proofs on ${L}_r$-differentiability. Statistics & Risk Modeling, 19(4), 419–426.
MATH Google Scholar
Rosset, S., Zhu, J. (2007). Piecewise linear regularized solution paths. The Annals of Statistics, 35(3), 1012–1030.
Article MathSciNet MATH Google Scholar
Tibshirani, R. (1994). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58, 267–288.
MathSciNet MATH Google Scholar
Van de Geer, S. (2014). Weakly decomposable regularization penalties and structured sparsity. Scandinavian Journal of Statistics, 41(1), 72–86.
Article MathSciNet MATH Google Scholar
Van de Geer, S. (2016). Estimation and testing under sparsity. Springer.
Van der Vaart, A. (2000). Asymptotic statistics, (Vol. 3). Cambridge University Press.
Van der Vaart, A., Wellner, J. (2013). Weak convergence and empirical processes: With applications to statistics. Springer Science & Business Media.
Vapnik, V. (1998). Statistical learning theory (Vol. 1). New York: Wiley.
MATH Google Scholar
Vito, E. D., Rosasco, L., Caponnetto, A., Piana, M., Verri, A. (2004). Some properties of regularized kernel methods. Journal of Machine Learning Research, 5, 1363–1390.
MathSciNet MATH Google Scholar
Von Mises, R. (1947). On the asymptotic distribution of differentiable statistical functions. The Annals of Mathematical Statistics, 18(3), 309–348.
Article MathSciNet MATH Google Scholar
Wellner, J. A. (1992). Empirical processes in action: A review. International Statistical Review/Revue Internationale de Statistique, 247–269.
Werner, D. (2006). Funktionalanalysis. Springer.
Werner, T. (2019). Gradient-Free Gradient Boosting. PhD thesis, Carl von Ossietzky Universität Oldenburg.
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429.
Article MathSciNet MATH Google Scholar
Zou, H., Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

The results presented in this paper are part of the author’s PhD thesis (Werner (2019)) supervised by P. Ruckdeschel at Carl von Ossietzky University Oldenburg. I thank an anonymous referee for valuable comments that really helped to improve the quality of the paper. I also thank Prof. T. Dickhaus for calling my attention to the work of Pötscher and Leeb.

Author information

Authors and Affiliations

Institute for Mathematics, Carl von Ossietzky University Oldenburg, P/O Box 2503, 26111, Oldenburg (Oldb), Germany
Tino Werner

Authors

Tino Werner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tino Werner.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Miscellaneous

The $L_2-$differentiability originally comes from LeCam (1970).

Definition 5

Let $\mathcal {P}:=\{P_{\theta } \ | \ \theta \in \varTheta \}$ be a family of probability measures on some measurable space $(\varOmega , \mathcal {A})$ and let $\varTheta$ be a subset of $\mathbb {R}^p$. Then, $\mathcal {P}$ is $L_2-$differentiable at $\theta _0$ if there exists $\varLambda _{\theta _0} \in L_2^p(P_{\theta _0})$ such that

$$\begin{aligned} \displaystyle \left| \left| \sqrt{dP_{\theta _0+h}}-\sqrt{dP_{\theta _0}}\left( 1+ \frac{1}{2} h^T \varLambda _{\theta _0} \right) \right| \right| _{L_2}=o(||h||) \end{aligned}$$

for $||h|| \rightarrow 0$. In this case, the function $\varLambda _{\theta _0}$ is the $L_2-$derivative and $I_{\theta _0}:=\mathbb {E}_{\theta _0}[\varLambda _{\theta _0}\varLambda _{\theta _0}^T]$ is the Fisher information of $\mathcal {P}$ at $\theta _0$.

Note that the $L_2-$differentiability is a special case of the wider concept of $L_r-$differentiability (cf. Rieder and Ruckdeschel 2001). The $L_2-$differentiability holds for many distribution families, including normal location and scale families, Poisson families, gamma families, and even for ARMA, ARCH and GPD families (Rieder et al. 2008, Pupashenko et al. 2015). A standard example of a distribution family that is not $L_2-$differentiable is the model $\mathcal {P}:=\{U([0,\theta ]) \ | \ \theta \in \varTheta \}$.

The following definition of partial influence curves and the corresponding asymptotically linear expansion in terms of such partial influence curves is borrowed from (Rieder 1994, Def. 4.2.10) and Rieder et al. (2008).

Definition 6

Let $(\varOmega ^n, \mathcal {A}^n)$ be a measurable space and let $S_n: (\varOmega ^n, \mathcal {A}^n) \rightarrow (\mathbb {R}^q, \mathbb {B}^q)$ be an estimator for the transformed quantity of interest $\tau (\theta )$. Assume that $\tau : \varTheta \rightarrow \mathbb {R}^q$ is differentiable at $\theta _0 \in \varTheta$ where $\varTheta \subset \mathbb {R}^p$ and $q \le p$. Denote the Jacobian by $\partial _{\theta _0} \tau =:D_{\theta _0} \in \mathbb {R}^{q \times p}$. Then, the set of partial influence curves is defined by

$$\begin{aligned} \displaystyle \varPsi _2^D(\theta _0):=\{\eta _{\theta _0} \in L_2^q(P_{\theta _0}) \ | \ \mathbb {E}_{\theta _0}[\eta _{\theta _0}]=0, \ \mathbb {E}_{\theta _0}[\eta _{\theta _0} \varLambda _{\theta _0}^T]=D_{\theta _0}\} . \end{aligned}$$

The sequence $(S_n)_n$ is asymptotically linear at $P_{\theta _0}$ if there exists a partial influence curve $\eta _{\theta _0} \in \varPsi _2^D(\theta _0)$ such that the expansion

$$\begin{aligned} \displaystyle S_n=\tau (\theta _0)+\frac{1}{n} \sum _{i=1}^n \eta _{\theta _0}(x_i)+o_{P_{\theta _0}^n}(n^{-1/2}) \end{aligned}$$

is valid.

For the following lemma, we refer to Evgrafov and Patriksson (2003) and Levitin and Tichatschke (1998).

Lemma 4

Let $f: \mathcal {X} \times \mathcal {Y} \times \varTheta \rightarrow \mathbb {R}$ be continuous, where $\mathcal {X} \subset \mathbb {R}^n$, $\mathcal {Y} \subset \mathbb {R}^m$, $\varTheta \subset \mathbb {R}^k$. Define $\varXi (x,y):=\mathrm{argmin}_{\theta }(f(x,y,\theta ))$. If f is coercive w.r.t. $\theta$, i.e., the sets $\{\theta \in \varTheta \ | \ f(x,y,\theta ) \le c\}$ are bounded for all $c \in \mathbb {R}$ for every $x \in \mathcal {X}$, $y \in \mathcal {Y}$, then $\min _{\theta }(f(x,y,\theta ))>-\infty$ and $\varXi (x,y)$ is nonempty and compact for any x, y.

About this article

Cite this article

Werner, T. Asymptotic linear expansion of regularized M-estimators. Ann Inst Stat Math 74, 167–194 (2022). https://doi.org/10.1007/s10463-021-00792-5

Download citation

Received: 05 September 2019
Revised: 16 October 2020
Accepted: 05 February 2021
Published: 24 March 2021
Issue Date: February 2022
DOI: https://doi.org/10.1007/s10463-021-00792-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Asymptotic linear expansion of regularized M-estimators

Abstract

Access this article

Similar content being viewed by others

Sum-of-Squares Relaxations for Information Theory and Variational Inference

Random Gradient-Free Minimization of Convex Functions

Finding global minima via kernel approximations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Miscellaneous

Definition 5

Definition 6

Lemma 4

About this article

Cite this article

Keywords

Navigation

Asymptotic linear expansion of regularized M-estimators

Abstract

Access this article

Similar content being viewed by others

Sum-of-Squares Relaxations for Information Theory and Variational Inference

Random Gradient-Free Minimization of Convex Functions

Finding global minima via kernel approximations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Miscellaneous

Miscellaneous

Definition 5

Definition 6

Lemma 4

About this article

Cite this article

Share this article

Keywords

Search

Navigation