Abstract
Parametric high-dimensional regression requires regularization terms to get interpretable models. The respective estimators correspond to regularized M-functionals which are naturally highly nonlinear. Their Gâteaux derivative, i.e., their influence curve linearizes the asymptotic bias of the estimator, but only up to a remainder term which is not guaranteed to tend (sufficiently fast) to zero uniformly on suitable tangent sets without profound arguments. We fill this gap by studying, in a unified framework, under which conditions the M-functionals corresponding to convex penalties as regularization are compactly differentiable, so that the estimators admit an asymptotically linear expansion. This key ingredient allows influence curves to reasonably enter model diagnosis and enable a fast, valid update formula, just requiring an evaluation of the corresponding influence curve at new data points. Moreover, this paves the way for optimally-robust estimators, bounding the influence curves in a suitable way.
Similar content being viewed by others
References
Alfons, A., Croux, C., Gelper, S. (2013). Sparse least trimmed squares regression for analyzing high-dimensional large data sets. The Annals of Applied Statistics, 7(1), 226–248.
Alqallaf, F., Van Aelst, S., Yohai, V. J., Zamar, R. H. (2009). Propagation of outliers in multivariate data. The Annals of Statistics, 37(1), 311–331.
Aravkin, A. Y., Burke, J. V., Pillonetto, G. (2013). Sparse/robust estimation and Kalman smoothing with nonsmooth log-concave densities: Modeling, computation, and theory. The Journal of Machine Learning Research, 14(1), 2689–2728.
Avella-Medina, M. (2017). Influence functions for penalized M-estimators. Bernoulli, 23(4B), 3178–3196.
Averbukh, V., Smolyanov, O. (1967). The theory of differentiation in linear topological spaces. Russian Mathematical Surveys, 22(6), 201–258.
Banerjee, O., Ghaoui, L. E., d’Aspremont, A. (2008). Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. Journal of Machine learning Research, 9, 485–516.
Berge, C. (1963). Topological Spaces: Including a treatment of multi-valued functions, vector spaces, and convexity. Courier Corporation.
Beutner, E., Zähle, H. (2010). A modified functional delta method and its application to the estimation of risk functionals. Journal of Multivariate Analysis, 101(10), 2452–2463.
Beutner, E., Zähle, H. (2016). Functional delta-method for the bootstrap of quasi-hadamard differentiable functionals. Electronic Journal of Statistics, 10(1), 1181–1222.
Bühlmann, P. (2006). Boosting for high-dimensional linear models. The Annals of Statistics, 34(2), 559–583.
Bühlmann, P., Hothorn, T. (2007). Boosting algorithms: Regularization, prediction and model fitting. Statistical Science, 22(4), 477–505.
Bühlmann, P., Van De Geer, S. (2011). Statistics for high-dimensional data: Methods, theory and applications. Springer Science & Business Media.
Chang, L., Roberts, S., Welsh, A. (2018). Robust lasso regression using tukey’s biweight criterion. Technometrics, 60(1), 36–47.
Chen, X., Wang, Z. J., McKeown, M. J. (2010a). Asymptotic analysis of robust lassos in the presence of noise with large variance. IEEE Transactions on Information Theory, 56(10), 5131–5149.
Chen, X., Wang, Z. J., McKeown, M. J. (2010b). Asymptotic analysis of the Huberized lasso estimator. In 2010 IEEE International conference on acoustics speech and signal processing (ICASSP),, pages 1898–1901. IEEE.
Christmann, A., Steinwart, I. (2004). On robustness properties of convex risk minimization methods for pattern recognition. Journal of Machine Learning Research, 5, 1007–1034.
Christmann, A., Van Messem, A. (2008). Bouligand derivatives and robustness of support vector machines for regression. Journal of Machine Learning Research, 9(May), 915–936.
Christmann, A., Van Messem, A., Steinwart, I. (2009). On consistency and robustness properties of support vector machines for heavy-tailed distributions. Statistics and Its Interface, 2(3), 311–327.
Clarke, B. R. (1983). Uniqueness and fréchet differentiability of functional solutions to maximum likelihood type equations. The Annals of Statistics, 11(4), 1196–1205.
Clémençon, S., Lugosi, G., Vayatis, N. (2008). Ranking and empirical minimization of U-statistics. The Annals of Statistics, 36(2), 844–874.
Clémençon, S., Depecker, M., Vayatis, N. (2013). An empirical comparison of learning algorithms for nonparametric scoring: The TreeRank algorithm and other methods. Pattern Analysis and Applications, 16(4), 475–496.
De los Reyes, J. C., Schönlieb, C.-B., Valkonen, T. . (2016). The structure of optimal parameters for image restoration problems. Journal of Mathematical Analysis and Applications, 434(1), 464–500.
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R. (2004). Least angle regression. The Annals of Statistics, 32(2), 407–499.
Evgrafov, A., Patriksson, M. (2003). Stochastic structural topology optimization: discretization and penalty function approach. Structural and Multidisciplinary Optimization, 25(3), 174–188.
Fan, J., Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360.
Fernholz, L. (1983). Lecture notes in statistics. In Von Mises Calculus for Statistical Functionals, volume 19. Springer.
Fraiman, R., Yohai, V. J., Zamar, R. H. (2001). Optimal robust m-estimates of location. The Annals of Statistics, 29(1), 194–223.
Fréchet, M. (1937). Sur la notion de différentielle dans l’analyse générale.
Friedman, J., Hastie, T., Höfling, H., Tibshirani, R. (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2), 302–332.
Friedman, J., Hastie, T., Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3), 432–441.
Gill, R. D., Wellner, J. A., Præstgaard, J. (1989). Non- and semi-parametric maximum likelihood estimators and the von mises method (part 1)[with discussion and reply]. Scandinavian Journal of Statistics, 16(2), 97–128.
Hable, R. (2012). Asymptotic normality of support vector machine variants and other regularized kernel methods. Journal of Multivariate Analysis, 106, 92–117.
Hampel, F. (1974). The influence curve and its role in robust estimation. Journal of the American Statistical Association, 69(346), 383–393.
Hampel, F., Ronchetti, E., Rousseeuw, P., Stahel, W. (2011). Robust statistics: The approach based on influence functions, (Vol. 114). John Wiley & Sons.
Huber, P. J., Ronchetti, E. (2009). Robust Statistics. Wiley.
Jain, N., Marcus, M. (1975). Central limit theorems for C(S)-valued random variables. Journal of Functional Analysis, 19(3), 216–231.
Kohl, M. (2005). Numerical contributions to the asymptotic theory of robustness. PhD thesis, University of Bayreuth.
Krätschmer, V., Schied, A., Zähle, H. (2012). Qualitative and infinitesimal robustness of tail-dependent statistical functionals. Journal of Multivariate Analysis, 103(1), 35–47.
Lambert-Lacroix, S., Zwald, L. (2011). Robust regression through the Huber’s criterion and adaptive lasso penalty. Electronic Journal of Statistics, 5, 1015–1053.
LeCam, L. (1970). On the assumptions used to prove asymptotic normality of maximum likelihood estimates. The Annals of Mathematical Statistics, 41(3), 802–828.
LeDell, E., Petersen, M., van der Laan, M. (2015). Computationally efficient confidence intervals for cross-validated area under the ROC curve estimates. Electronic Journal of Statistics, 9(1), 1583.
Lee, S. (2015). An additive sparse penalty for variable selection in high-dimensional linear regression model. Communications for Statistical Applications and Methods, 22(2), 147–157.
Levitin, E., Tichatschke, R. (1998). On smoothing of parametric minimax-functions and generalized max-functions via regularization. Journal of Convex Analysis, 5, 199–220.
Loh, P.-L. (2017). Statistical consistency and asymptotic normality for high-dimensional robust M-estimators. The Annals of Statistics, 45(2), 866–896.
Loh, P.-L., Wainwright, M. J. (2015). Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Journal of Machine Learning Research, 16, 559–616.
Maronna, R., Martin, R., Yohai, V. (2006). Robust statistics: Theory and methods. Annals of Statistics, 30, 17–23.
Negahban, S. N., Ravikumar, P., Wainwright, M. J., Yu, B. (2012). A unified framework for high-dimensional analysis of \(m\)-estimators with decomposable regularizers. Statistical Science, 27(4), 538–557.
Öllerer, V., Croux, C., Alfons, A. (2015). The influence function of penalized regression estimators. Statistics, 49(4), 741–765.
Osborne, M. R., Presnell, B., Turlach, B. A. (2000). A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis, 20(3), 389–403.
Pötscher, B. M., Leeb, H. (2009). On the distribution of penalized maximum likelihood estimators: The lasso, scad, and thresholding. Journal of Multivariate Analysis, 100(9), 2065–2082.
Pötscher, B. M., Schneider, U. (2009). On the distribution of the adaptive lasso estimator. Journal of Statistical Planning and Inference, 139(8), 2775–2790.
Pupashenko, D., Ruckdeschel, P., Kohl, M. (2015). L2 differentiability of generalized linear models. Statistics & Probability Letters, 97(C), 155–164.
Reeds, J. (1976). On the definition of von Mises functionals. PhD thesis, Harvard University.
Rieder, H. (1994). Robust asymptotic statistics (Vol. 1). Springer Science & Business Media.
Rieder, H., Kohl, M., Ruckdeschel, P. (2008). The cost of not knowing the radius. Statistical Methods & Applications, 17(1), 13–40.
Rieder, H., Ruckdeschel, P. (2001). Short proofs on \({L}_r\)-differentiability. Statistics & Risk Modeling, 19(4), 419–426.
Rosset, S., Zhu, J. (2007). Piecewise linear regularized solution paths. The Annals of Statistics, 35(3), 1012–1030.
Tibshirani, R. (1994). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58, 267–288.
Van de Geer, S. (2014). Weakly decomposable regularization penalties and structured sparsity. Scandinavian Journal of Statistics, 41(1), 72–86.
Van de Geer, S. (2016). Estimation and testing under sparsity. Springer.
Van der Vaart, A. (2000). Asymptotic statistics, (Vol. 3). Cambridge University Press.
Van der Vaart, A., Wellner, J. (2013). Weak convergence and empirical processes: With applications to statistics. Springer Science & Business Media.
Vapnik, V. (1998). Statistical learning theory (Vol. 1). New York: Wiley.
Vito, E. D., Rosasco, L., Caponnetto, A., Piana, M., Verri, A. (2004). Some properties of regularized kernel methods. Journal of Machine Learning Research, 5, 1363–1390.
Von Mises, R. (1947). On the asymptotic distribution of differentiable statistical functions. The Annals of Mathematical Statistics, 18(3), 309–348.
Wellner, J. A. (1992). Empirical processes in action: A review. International Statistical Review/Revue Internationale de Statistique, 247–269.
Werner, D. (2006). Funktionalanalysis. Springer.
Werner, T. (2019). Gradient-Free Gradient Boosting. PhD thesis, Carl von Ossietzky Universität Oldenburg.
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429.
Zou, H., Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.
Acknowledgements
The results presented in this paper are part of the author’s PhD thesis (Werner (2019)) supervised by P. Ruckdeschel at Carl von Ossietzky University Oldenburg. I thank an anonymous referee for valuable comments that really helped to improve the quality of the paper. I also thank Prof. T. Dickhaus for calling my attention to the work of Pötscher and Leeb.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Miscellaneous
Miscellaneous
The \(L_2-\)differentiability originally comes from LeCam (1970).
Definition 5
Let \(\mathcal {P}:=\{P_{\theta } \ | \ \theta \in \varTheta \}\) be a family of probability measures on some measurable space \((\varOmega , \mathcal {A})\) and let \(\varTheta\) be a subset of \(\mathbb {R}^p\). Then, \(\mathcal {P}\) is \(L_2-\)differentiable at \(\theta _0\) if there exists \(\varLambda _{\theta _0} \in L_2^p(P_{\theta _0})\) such that
for \(||h|| \rightarrow 0\). In this case, the function \(\varLambda _{\theta _0}\) is the \(L_2-\)derivative and \(I_{\theta _0}:=\mathbb {E}_{\theta _0}[\varLambda _{\theta _0}\varLambda _{\theta _0}^T]\) is the Fisher information of \(\mathcal {P}\) at \(\theta _0\).
Note that the \(L_2-\)differentiability is a special case of the wider concept of \(L_r-\)differentiability (cf. Rieder and Ruckdeschel 2001). The \(L_2-\)differentiability holds for many distribution families, including normal location and scale families, Poisson families, gamma families, and even for ARMA, ARCH and GPD families (Rieder et al. 2008, Pupashenko et al. 2015). A standard example of a distribution family that is not \(L_2-\)differentiable is the model \(\mathcal {P}:=\{U([0,\theta ]) \ | \ \theta \in \varTheta \}\).
The following definition of partial influence curves and the corresponding asymptotically linear expansion in terms of such partial influence curves is borrowed from (Rieder 1994, Def. 4.2.10) and Rieder et al. (2008).
Definition 6
Let \((\varOmega ^n, \mathcal {A}^n)\) be a measurable space and let \(S_n: (\varOmega ^n, \mathcal {A}^n) \rightarrow (\mathbb {R}^q, \mathbb {B}^q)\) be an estimator for the transformed quantity of interest \(\tau (\theta )\). Assume that \(\tau : \varTheta \rightarrow \mathbb {R}^q\) is differentiable at \(\theta _0 \in \varTheta\) where \(\varTheta \subset \mathbb {R}^p\) and \(q \le p\). Denote the Jacobian by \(\partial _{\theta _0} \tau =:D_{\theta _0} \in \mathbb {R}^{q \times p}\). Then, the set of partial influence curves is defined by
The sequence \((S_n)_n\) is asymptotically linear at \(P_{\theta _0}\) if there exists a partial influence curve \(\eta _{\theta _0} \in \varPsi _2^D(\theta _0)\) such that the expansion
is valid.
For the following lemma, we refer to Evgrafov and Patriksson (2003) and Levitin and Tichatschke (1998).
Lemma 4
Let \(f: \mathcal {X} \times \mathcal {Y} \times \varTheta \rightarrow \mathbb {R}\) be continuous, where \(\mathcal {X} \subset \mathbb {R}^n\), \(\mathcal {Y} \subset \mathbb {R}^m\), \(\varTheta \subset \mathbb {R}^k\). Define \(\varXi (x,y):=\mathrm{argmin}_{\theta }(f(x,y,\theta ))\). If f is coercive w.r.t. \(\theta\), i.e., the sets \(\{\theta \in \varTheta \ | \ f(x,y,\theta ) \le c\}\) are bounded for all \(c \in \mathbb {R}\) for every \(x \in \mathcal {X}\), \(y \in \mathcal {Y}\), then \(\min _{\theta }(f(x,y,\theta ))>-\infty\) and \(\varXi (x,y)\) is nonempty and compact for any x, y.
About this article
Cite this article
Werner, T. Asymptotic linear expansion of regularized M-estimators. Ann Inst Stat Math 74, 167–194 (2022). https://doi.org/10.1007/s10463-021-00792-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10463-021-00792-5