Skip to main content
Log in

Asymptotic linear expansion of regularized M-estimators

  • Published:
Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Abstract

Parametric high-dimensional regression requires regularization terms to get interpretable models. The respective estimators correspond to regularized M-functionals which are naturally highly nonlinear. Their Gâteaux derivative, i.e., their influence curve linearizes the asymptotic bias of the estimator, but only up to a remainder term which is not guaranteed to tend (sufficiently fast) to zero uniformly on suitable tangent sets without profound arguments. We fill this gap by studying, in a unified framework, under which conditions the M-functionals corresponding to convex penalties as regularization are compactly differentiable, so that the estimators admit an asymptotically linear expansion. This key ingredient allows influence curves to reasonably enter model diagnosis and enable a fast, valid update formula, just requiring an evaluation of the corresponding influence curve at new data points. Moreover, this paves the way for optimally-robust estimators, bounding the influence curves in a suitable way.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Alfons, A., Croux, C., Gelper, S. (2013). Sparse least trimmed squares regression for analyzing high-dimensional large data sets. The Annals of Applied Statistics, 7(1), 226–248.

    Article  MathSciNet  MATH  Google Scholar 

  • Alqallaf, F., Van Aelst, S., Yohai, V. J., Zamar, R. H. (2009). Propagation of outliers in multivariate data. The Annals of Statistics, 37(1), 311–331.

    Article  MathSciNet  MATH  Google Scholar 

  • Aravkin, A. Y., Burke, J. V., Pillonetto, G. (2013). Sparse/robust estimation and Kalman smoothing with nonsmooth log-concave densities: Modeling, computation, and theory. The Journal of Machine Learning Research, 14(1), 2689–2728.

    MathSciNet  MATH  Google Scholar 

  • Avella-Medina, M. (2017). Influence functions for penalized M-estimators. Bernoulli, 23(4B), 3178–3196.

    Article  MathSciNet  MATH  Google Scholar 

  • Averbukh, V., Smolyanov, O. (1967). The theory of differentiation in linear topological spaces. Russian Mathematical Surveys, 22(6), 201–258.

    Article  MATH  Google Scholar 

  • Banerjee, O., Ghaoui, L. E., d’Aspremont, A. (2008). Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. Journal of Machine learning Research, 9, 485–516.

    MathSciNet  MATH  Google Scholar 

  • Berge, C. (1963). Topological Spaces: Including a treatment of multi-valued functions, vector spaces, and convexity. Courier Corporation.

  • Beutner, E., Zähle, H. (2010). A modified functional delta method and its application to the estimation of risk functionals. Journal of Multivariate Analysis, 101(10), 2452–2463.

    Article  MathSciNet  MATH  Google Scholar 

  • Beutner, E., Zähle, H. (2016). Functional delta-method for the bootstrap of quasi-hadamard differentiable functionals. Electronic Journal of Statistics, 10(1), 1181–1222.

    Article  MathSciNet  MATH  Google Scholar 

  • Bühlmann, P. (2006). Boosting for high-dimensional linear models. The Annals of Statistics, 34(2), 559–583.

    Article  MathSciNet  MATH  Google Scholar 

  • Bühlmann, P., Hothorn, T. (2007). Boosting algorithms: Regularization, prediction and model fitting. Statistical Science, 22(4), 477–505.

    MathSciNet  MATH  Google Scholar 

  • Bühlmann, P., Van De Geer, S. (2011). Statistics for high-dimensional data: Methods, theory and applications. Springer Science & Business Media.

  • Chang, L., Roberts, S., Welsh, A. (2018). Robust lasso regression using tukey’s biweight criterion. Technometrics, 60(1), 36–47.

    Article  MathSciNet  Google Scholar 

  • Chen, X., Wang, Z. J., McKeown, M. J. (2010a). Asymptotic analysis of robust lassos in the presence of noise with large variance. IEEE Transactions on Information Theory, 56(10), 5131–5149.

    Article  MathSciNet  MATH  Google Scholar 

  • Chen, X., Wang, Z. J., McKeown, M. J. (2010b). Asymptotic analysis of the Huberized lasso estimator. In 2010 IEEE International conference on acoustics speech and signal processing (ICASSP),, pages 1898–1901. IEEE.

  • Christmann, A., Steinwart, I. (2004). On robustness properties of convex risk minimization methods for pattern recognition. Journal of Machine Learning Research, 5, 1007–1034.

    MathSciNet  MATH  Google Scholar 

  • Christmann, A., Van Messem, A. (2008). Bouligand derivatives and robustness of support vector machines for regression. Journal of Machine Learning Research, 9(May), 915–936.

    MathSciNet  MATH  Google Scholar 

  • Christmann, A., Van Messem, A., Steinwart, I. (2009). On consistency and robustness properties of support vector machines for heavy-tailed distributions. Statistics and Its Interface, 2(3), 311–327.

    Article  MathSciNet  MATH  Google Scholar 

  • Clarke, B. R. (1983). Uniqueness and fréchet differentiability of functional solutions to maximum likelihood type equations. The Annals of Statistics, 11(4), 1196–1205.

    Article  MathSciNet  MATH  Google Scholar 

  • Clémençon, S., Lugosi, G., Vayatis, N. (2008). Ranking and empirical minimization of U-statistics. The Annals of Statistics, 36(2), 844–874.

    Article  MathSciNet  MATH  Google Scholar 

  • Clémençon, S., Depecker, M., Vayatis, N. (2013). An empirical comparison of learning algorithms for nonparametric scoring: The TreeRank algorithm and other methods. Pattern Analysis and Applications, 16(4), 475–496.

    Article  MathSciNet  MATH  Google Scholar 

  • De los Reyes, J. C., Schönlieb, C.-B., Valkonen, T. . (2016). The structure of optimal parameters for image restoration problems. Journal of Mathematical Analysis and Applications, 434(1), 464–500.

    Article  MathSciNet  MATH  Google Scholar 

  • Efron, B., Hastie, T., Johnstone, I., Tibshirani, R. (2004). Least angle regression. The Annals of Statistics, 32(2), 407–499.

    Article  MathSciNet  MATH  Google Scholar 

  • Evgrafov, A., Patriksson, M. (2003). Stochastic structural topology optimization: discretization and penalty function approach. Structural and Multidisciplinary Optimization, 25(3), 174–188.

    Article  MathSciNet  MATH  Google Scholar 

  • Fan, J., Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360.

    Article  MathSciNet  MATH  Google Scholar 

  • Fernholz, L. (1983). Lecture notes in statistics. In Von Mises Calculus for Statistical Functionals, volume 19. Springer.

  • Fraiman, R., Yohai, V. J., Zamar, R. H. (2001). Optimal robust m-estimates of location. The Annals of Statistics, 29(1), 194–223.

    Article  MathSciNet  MATH  Google Scholar 

  • Fréchet, M. (1937). Sur la notion de différentielle dans l’analyse générale.

  • Friedman, J., Hastie, T., Höfling, H., Tibshirani, R. (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2), 302–332.

    Article  MathSciNet  MATH  Google Scholar 

  • Friedman, J., Hastie, T., Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3), 432–441.

    Article  MATH  Google Scholar 

  • Gill, R. D., Wellner, J. A., Præstgaard, J. (1989). Non- and semi-parametric maximum likelihood estimators and the von mises method (part 1)[with discussion and reply]. Scandinavian Journal of Statistics, 16(2), 97–128.

    MathSciNet  Google Scholar 

  • Hable, R. (2012). Asymptotic normality of support vector machine variants and other regularized kernel methods. Journal of Multivariate Analysis, 106, 92–117.

    Article  MathSciNet  MATH  Google Scholar 

  • Hampel, F. (1974). The influence curve and its role in robust estimation. Journal of the American Statistical Association, 69(346), 383–393.

    Article  MathSciNet  MATH  Google Scholar 

  • Hampel, F., Ronchetti, E., Rousseeuw, P., Stahel, W. (2011). Robust statistics: The approach based on influence functions, (Vol. 114). John Wiley & Sons.

  • Huber, P. J., Ronchetti, E. (2009). Robust Statistics. Wiley.

  • Jain, N., Marcus, M. (1975). Central limit theorems for C(S)-valued random variables. Journal of Functional Analysis, 19(3), 216–231.

    Article  MathSciNet  MATH  Google Scholar 

  • Kohl, M. (2005). Numerical contributions to the asymptotic theory of robustness. PhD thesis, University of Bayreuth.

  • Krätschmer, V., Schied, A., Zähle, H. (2012). Qualitative and infinitesimal robustness of tail-dependent statistical functionals. Journal of Multivariate Analysis, 103(1), 35–47.

    Article  MathSciNet  MATH  Google Scholar 

  • Lambert-Lacroix, S., Zwald, L. (2011). Robust regression through the Huber’s criterion and adaptive lasso penalty. Electronic Journal of Statistics, 5, 1015–1053.

    Article  MathSciNet  MATH  Google Scholar 

  • LeCam, L. (1970). On the assumptions used to prove asymptotic normality of maximum likelihood estimates. The Annals of Mathematical Statistics, 41(3), 802–828.

    Article  MathSciNet  Google Scholar 

  • LeDell, E., Petersen, M., van der Laan, M. (2015). Computationally efficient confidence intervals for cross-validated area under the ROC curve estimates. Electronic Journal of Statistics, 9(1), 1583.

    Article  MathSciNet  MATH  Google Scholar 

  • Lee, S. (2015). An additive sparse penalty for variable selection in high-dimensional linear regression model. Communications for Statistical Applications and Methods, 22(2), 147–157.

    Article  Google Scholar 

  • Levitin, E., Tichatschke, R. (1998). On smoothing of parametric minimax-functions and generalized max-functions via regularization. Journal of Convex Analysis, 5, 199–220.

    MathSciNet  MATH  Google Scholar 

  • Loh, P.-L. (2017). Statistical consistency and asymptotic normality for high-dimensional robust M-estimators. The Annals of Statistics, 45(2), 866–896.

    Article  MathSciNet  MATH  Google Scholar 

  • Loh, P.-L., Wainwright, M. J. (2015). Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Journal of Machine Learning Research, 16, 559–616.

    MathSciNet  MATH  Google Scholar 

  • Maronna, R., Martin, R., Yohai, V. (2006). Robust statistics: Theory and methods. Annals of Statistics, 30, 17–23.

    MATH  Google Scholar 

  • Negahban, S. N., Ravikumar, P., Wainwright, M. J., Yu, B. (2012). A unified framework for high-dimensional analysis of \(m\)-estimators with decomposable regularizers. Statistical Science, 27(4), 538–557.

    Article  MathSciNet  MATH  Google Scholar 

  • Öllerer, V., Croux, C., Alfons, A. (2015). The influence function of penalized regression estimators. Statistics, 49(4), 741–765.

    Article  MathSciNet  MATH  Google Scholar 

  • Osborne, M. R., Presnell, B., Turlach, B. A. (2000). A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis, 20(3), 389–403.

    Article  MathSciNet  MATH  Google Scholar 

  • Pötscher, B. M., Leeb, H. (2009). On the distribution of penalized maximum likelihood estimators: The lasso, scad, and thresholding. Journal of Multivariate Analysis, 100(9), 2065–2082.

    Article  MathSciNet  MATH  Google Scholar 

  • Pötscher, B. M., Schneider, U. (2009). On the distribution of the adaptive lasso estimator. Journal of Statistical Planning and Inference, 139(8), 2775–2790.

    Article  MathSciNet  MATH  Google Scholar 

  • Pupashenko, D., Ruckdeschel, P., Kohl, M. (2015). L2 differentiability of generalized linear models. Statistics & Probability Letters, 97(C), 155–164.

  • Reeds, J. (1976). On the definition of von Mises functionals. PhD thesis, Harvard University.

  • Rieder, H. (1994). Robust asymptotic statistics (Vol. 1). Springer Science & Business Media.

  • Rieder, H., Kohl, M., Ruckdeschel, P. (2008). The cost of not knowing the radius. Statistical Methods & Applications, 17(1), 13–40.

    Article  MathSciNet  MATH  Google Scholar 

  • Rieder, H., Ruckdeschel, P. (2001). Short proofs on \({L}_r\)-differentiability. Statistics & Risk Modeling, 19(4), 419–426.

    MATH  Google Scholar 

  • Rosset, S., Zhu, J. (2007). Piecewise linear regularized solution paths. The Annals of Statistics, 35(3), 1012–1030.

    Article  MathSciNet  MATH  Google Scholar 

  • Tibshirani, R. (1994). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58, 267–288.

    MathSciNet  MATH  Google Scholar 

  • Van de Geer, S. (2014). Weakly decomposable regularization penalties and structured sparsity. Scandinavian Journal of Statistics, 41(1), 72–86.

    Article  MathSciNet  MATH  Google Scholar 

  • Van de Geer, S. (2016). Estimation and testing under sparsity. Springer.

  • Van der Vaart, A. (2000). Asymptotic statistics, (Vol. 3). Cambridge University Press.

  • Van der Vaart, A., Wellner, J. (2013). Weak convergence and empirical processes: With applications to statistics. Springer Science & Business Media.

  • Vapnik, V. (1998). Statistical learning theory (Vol. 1). New York: Wiley.

    MATH  Google Scholar 

  • Vito, E. D., Rosasco, L., Caponnetto, A., Piana, M., Verri, A. (2004). Some properties of regularized kernel methods. Journal of Machine Learning Research, 5, 1363–1390.

    MathSciNet  MATH  Google Scholar 

  • Von Mises, R. (1947). On the asymptotic distribution of differentiable statistical functions. The Annals of Mathematical Statistics, 18(3), 309–348.

    Article  MathSciNet  MATH  Google Scholar 

  • Wellner, J. A. (1992). Empirical processes in action: A review. International Statistical Review/Revue Internationale de Statistique, 247–269.

  • Werner, D. (2006). Funktionalanalysis. Springer.

  • Werner, T. (2019). Gradient-Free Gradient Boosting. PhD thesis, Carl von Ossietzky Universität Oldenburg.

  • Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429.

    Article  MathSciNet  MATH  Google Scholar 

  • Zou, H., Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The results presented in this paper are part of the author’s PhD thesis (Werner (2019)) supervised by P. Ruckdeschel at Carl von Ossietzky University Oldenburg. I thank an anonymous referee for valuable comments that really helped to improve the quality of the paper. I also thank Prof. T. Dickhaus for calling my attention to the work of Pötscher and Leeb.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tino Werner.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Miscellaneous

Miscellaneous

The \(L_2-\)differentiability originally comes from LeCam (1970).

Definition 5

Let \(\mathcal {P}:=\{P_{\theta } \ | \ \theta \in \varTheta \}\) be a family of probability measures on some measurable space \((\varOmega , \mathcal {A})\) and let \(\varTheta\) be a subset of \(\mathbb {R}^p\). Then, \(\mathcal {P}\) is \(L_2-\)differentiable at \(\theta _0\) if there exists \(\varLambda _{\theta _0} \in L_2^p(P_{\theta _0})\) such that

$$\begin{aligned} \displaystyle \left| \left| \sqrt{dP_{\theta _0+h}}-\sqrt{dP_{\theta _0}}\left( 1+ \frac{1}{2} h^T \varLambda _{\theta _0} \right) \right| \right| _{L_2}=o(||h||) \end{aligned}$$

for \(||h|| \rightarrow 0\). In this case, the function \(\varLambda _{\theta _0}\) is the \(L_2-\)derivative and \(I_{\theta _0}:=\mathbb {E}_{\theta _0}[\varLambda _{\theta _0}\varLambda _{\theta _0}^T]\) is the Fisher information of \(\mathcal {P}\) at \(\theta _0\).

Note that the \(L_2-\)differentiability is a special case of the wider concept of \(L_r-\)differentiability (cf. Rieder and Ruckdeschel 2001). The \(L_2-\)differentiability holds for many distribution families, including normal location and scale families, Poisson families, gamma families, and even for ARMA, ARCH and GPD families (Rieder et al. 2008, Pupashenko et al. 2015). A standard example of a distribution family that is not \(L_2-\)differentiable is the model \(\mathcal {P}:=\{U([0,\theta ]) \ | \ \theta \in \varTheta \}\).

The following definition of partial influence curves and the corresponding asymptotically linear expansion in terms of such partial influence curves is borrowed from (Rieder 1994, Def. 4.2.10) and Rieder et al. (2008).

Definition 6

Let \((\varOmega ^n, \mathcal {A}^n)\) be a measurable space and let \(S_n: (\varOmega ^n, \mathcal {A}^n) \rightarrow (\mathbb {R}^q, \mathbb {B}^q)\) be an estimator for the transformed quantity of interest \(\tau (\theta )\). Assume that \(\tau : \varTheta \rightarrow \mathbb {R}^q\) is differentiable at \(\theta _0 \in \varTheta\) where \(\varTheta \subset \mathbb {R}^p\) and \(q \le p\). Denote the Jacobian by \(\partial _{\theta _0} \tau =:D_{\theta _0} \in \mathbb {R}^{q \times p}\). Then, the set of partial influence curves is defined by

$$\begin{aligned} \displaystyle \varPsi _2^D(\theta _0):=\{\eta _{\theta _0} \in L_2^q(P_{\theta _0}) \ | \ \mathbb {E}_{\theta _0}[\eta _{\theta _0}]=0, \ \mathbb {E}_{\theta _0}[\eta _{\theta _0} \varLambda _{\theta _0}^T]=D_{\theta _0}\} . \end{aligned}$$

The sequence \((S_n)_n\) is asymptotically linear at \(P_{\theta _0}\) if there exists a partial influence curve \(\eta _{\theta _0} \in \varPsi _2^D(\theta _0)\) such that the expansion

$$\begin{aligned} \displaystyle S_n=\tau (\theta _0)+\frac{1}{n} \sum _{i=1}^n \eta _{\theta _0}(x_i)+o_{P_{\theta _0}^n}(n^{-1/2}) \end{aligned}$$

is valid.

For the following lemma, we refer to Evgrafov and Patriksson (2003) and Levitin and Tichatschke (1998).

Lemma 4

Let \(f: \mathcal {X} \times \mathcal {Y} \times \varTheta \rightarrow \mathbb {R}\) be continuous, where \(\mathcal {X} \subset \mathbb {R}^n\), \(\mathcal {Y} \subset \mathbb {R}^m\), \(\varTheta \subset \mathbb {R}^k\). Define \(\varXi (x,y):=\mathrm{argmin}_{\theta }(f(x,y,\theta ))\). If f is coercive w.r.t. \(\theta\), i.e., the sets \(\{\theta \in \varTheta \ | \ f(x,y,\theta ) \le c\}\) are bounded for all \(c \in \mathbb {R}\) for every \(x \in \mathcal {X}\), \(y \in \mathcal {Y}\), then \(\min _{\theta }(f(x,y,\theta ))>-\infty\) and \(\varXi (x,y)\) is nonempty and compact for any x, y.

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Werner, T. Asymptotic linear expansion of regularized M-estimators. Ann Inst Stat Math 74, 167–194 (2022). https://doi.org/10.1007/s10463-021-00792-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10463-021-00792-5

Keywords

Navigation