当前位置: X-MOL 学术Int. Stat. Rev. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
High‐dimensional Statistics: A Non‐asymptotic Viewpoint, Martin J. Wainwright, Cambridge University Press, 2019, xvii 552 pages, £57.99, hardback ISBN: 978‐1‐1084‐9802‐9
International Statistical Review ( IF 1.7 ) Pub Date : 2020-04-12 , DOI: 10.1111/insr.12370
G. Alastair Young 1
Affiliation  

Readership: Statistics/machine learning graduate students and researchers.

This is an excellent book. It provides a lucid, accessible and in‐depth treatment of non‐asymptotic high‐dimensional statistical theory, which is critical as the underpinning of modern statistics and machine learning. It succeeds brilliantly in providing a self‐contained overview of high‐dimensional statistics, suitable for use in formal courses or for self‐study by graduate‐level students or researchers. The treatment is outstandingly clear and engaging, and the production is first‐rate. It will quickly become essential reading and the key reference text in the field.

Conventional, classical statistics, as developed in the early 1900s, is founded on an asymptotic regime in which the dimension p of the parameter in the statistical model remains fixed as the sample size n grows to infinity. Standard laws of large numbers and the central limit theorem then furnish a general suite of inferential techniques, typically based on the asymptotic consistency, normality and efficiency of the maximum likelihood estimator. Such inferential techniques, which served as the bedrock of statistical analysis for decades, have been extended, from the 1980s onwards, by the development of refined, likelihood‐based and bootstrap methods of distributional approximation: some of the key elements of this substantial theory of ‘higher‐order asymptotics’ are described in the brief review article Young (2009). Typical focus in classical theory concerns a parametric model F(y;θ) indexed by a p−dimensional parameter θ=(ψ,λ), where ψ is a scalar interest parameter and λ is a (p−1)−dimensional nuisance parameter. Inference on ψ is based on a random sample Y of size n from F(y;θ), and for instance, it is required to construct a confidence interval Iα(Y) for ψ, of nominal coverage 1−α. Bootstrap or analytic approximation is made for the sampling distribution of a ‘pivot’ T(Y,θ), such as the signed square root of the likelihood ratio statistic or some modification thereof. This estimated sampling distribution is then used to construct an accurate confidence set Iα(Y), with the property, valid assuming only correctness of the model F(y;θ), that

P r θ { ψ I α ( Y ) } = 1 α + O ( n r ) ,
for quantifiable r, typically r=1 or r=3/2. Though the error term O(nr) will typically depend on unknown quantities, so such a result is an asymptotic statement, the operational interpretation is immediate: as n, the confidence set yields exactly the nominal coverage 1−α under repeated sampling. In cases where λ is of low dimension, it is a rule of thumb that if r=3/2, the confidence set will give essentially exact coverage for modest sample size, say n=10,20, though there are no guarantees from the theory of the magnitude of error for any finite n.

But, the data sets which arise in many areas of modern science and engineering generally have parameter dimension p of the same order, and often exceeding, sample size n, and for such problems classical statistical theory may fail to provide useful estimation or prediction, or simply break down completely. While investigations have established that accurate inference may often be obtained from higher‐order asymptotics methodology in circumstances where the parameter dimension p is relatively large compared to available sample size n (see evidence contained in Barndorff‐Nielsen & Cox, 1994, for instance), there has been relatively little systematic direct theoretical examination of what Wainwright terms ‘high‐dimensional asymptotics’, where the pair (n,p) are taken to infinity simultaneously, in such a way that some scaling function of (n,p), and possibly other problem parameters, remains fixed or converges to some finite limit. Instead, focus in modern statistical theory has emphasised non‐asymptotic results in high‐dimensional problems. In this theory, the pair (n,p) as well as other problem parameters are viewed as fixed, and high‐probability statements, say about the error of a parameter estimator, are made as a function of them. As its title suggests, non‐asymptotic, theoretical results of this type are the focus of this book. Chapter 1 gives a beautiful overview, illustrating through key examples involving linear discriminant analysis, covariance estimation and nonparametric regression, what can go wrong with classical statistics in high dimensions, and motivating persuasively the non‐asymptotic viewpoint. The kind of non‐asymptotic theory developed in the book, founded on obtaining bounds on the tails of a random quantity, or concentration inequalities which provide bounds on how a random variable deviates from some value, such as its mean, has its main value in being able to be used to predict some aspects of high‐dimensional asymptotic phenomena, such as limiting forms, as (n,p) grow, of error probabilities in the linear discriminant problem. Further, the scaling functions that emerge in a non‐asymptotic analysis can suggest the appropriate high‐dimensional asymptotic analysis to perform in order to reveal relevant limiting distributional behaviour. The non‐asymptotic analysis is typically not used, or immediately available, to provide precise statements about behaviour, say, of an estimator for a given, finite (n,p).

As an example of a non‐asymptotic result, we take illustration from Wainwright (Example 7.14), concerning the classical linear model

Y = X β + ϵ ,
where the design matrix X R n × p is deterministic and the noise vector ϵ has independent elements, identically distributed as N(0,σ2). Assume that X satisfies some (checkable, since X is fixed) ‘restricted eigenvalue’ condition and that it has max j = 1 , , p X j 2 n C , where Xj is the jth column of X. Suppose further that the vector β is supported on a subset S⊆{1,2,…,p} with |S|=s, so that β is sparse, with s non‐zero elements. If we define the Lasso estimator β ^ of β as the minimiser of
1 2 n Y X β 2 2 + λ β 1 ,
we have that, for constant K,
β ^ β 2 K s 2 log p n + δ ,
with probability at least 1 2 e n δ 2 / 2 , for any δ>0, if we set λ = 2 C σ 2 log p n + δ . Then we see that, provided K, which is defined in terms of C,σ2 and the eigenvalue condition, stays fixed as (n,p) increase, the Lasso estimator β ^ is consistent, as long as log p is dominated by n, if the size of the true β, as determined by s, remains fixed. Such a result predicts rather little, though, about the finite sample behaviour of the estimator β ^ : the bound on β ^ β 2 depends, inter alia, on the unknown error variance σ2 and the unknown true β, through its sparsity level s. As is the case for classical statistical theory developed for the fixed p regime, non‐asymptotic results of this kind only offer precise guarantees, therefore, asymptotically. But, the beauty of the theory of high‐dimensional statistics as described by Wainwright is precisely that non‐asymptotic results can yield strong operational support to practical methods of data analysis, such as the Lasso described above. For instance, while the Lasso estimator is not universally optimal, it comes close to mimicking the properties of an oracle estimator (which knows the true state of nature) in many sparse settings, in estimation of the mean E(Y) of Y when X is given. The Lasso predicts E(Y) almost as well as an oracle which knows which of the elements of β are non‐zero.

Chapter 1 of Wainwright's book gives also a very clear account of what enables statistics in the high‐dimensional setting. What saves us is the reasonable expectation that high‐dimensional data are actually endowed with some form of low‐dimensional structure, typically some form of sparsity, which might crudely be expressed as meaning that only s of the p parameters of the model are non‐zero or non‐negligible, where s is much smaller than p. Much of high‐dimensional statistics therefore involves constructing models of intrinsically high‐dimensional phenomena, but where the models incorporate some implicit form of low‐dimensional structure, which can be successfully revealed from sample data. The introductory chapter is, in its own right, a tour de force, but sets the scene for a marvelous account of the mathematics and methodology of all the main elements of high‐dimensional statistical theory. Beautifully signposted, the subsequent material of the book really divides into two types. Foundational material on tools and techniques, such as concentration inequalities, concentration of measure, uniform laws of large numbers, notions of covering and packing, reproducing kernel Hilbert spaces and techniques for obtaining minimax lower bounds is elegantly described. Of mathematical interest in its own right, this material is directed here to derive theory that is broadly applicable in high‐dimensional statistics. Crucially for those interested in statistical practice, the book also provides a thorough account of the models and estimators used in data analysis. The text includes a series of chapters each focused on a particular class of statistical estimation problems, including covariance estimation, the sparse linear model, principal component analysis, estimators based on decomposable regularisers, estimation of low‐rank matrices, graphical models and least squares estimation in a nonparametric setting: the principal aim is to shed light on the theoretical guarantees that are offered by widely used estimation techniques. Careful attention is paid throughout to computational considerations, such as the gains that are made by deriving from an initial otherwise NP‐hard optimization problem a convex criterion that can be optimised efficiently, while ensuring that the resulting statistical procedure is almost as good as that initially considered. Ideas and formalities are illustrated throughout the text with carefully chosen examples, primarily of a theoretical, rather than data analytic, nature. A minor quibble of this reviewer is the lack of a glossary of notation: someone who is not immersed in the area might wonder what precisely ≍ and mean.

In summary, this is an authoritative, scholarly and highly useful summary of high‐dimensional statistics. It is dense and not for the faint of heart, though the author offers excellent routemaps through the material. As a text, it stands natural comparison with Giroud (2015) and Bühlmann and van de Geer (2011), but to suggest any preference here would be invidious.

References

Barndorff‐Nielsen, O.E. and Cox, D.R. (1994). Inference and Asymptotics. CRC Press.

Bühlmann, P. and van de Geer, S. (2011). Statistics for High‐Dimensional Data: Methods, Theory and Applications. Springer.

Giroud, C. (2015). Introduction to High‐Dimensional Statistics. CRC Press.

Young, G.A. (2009). Routes to higher‐order accuracy in parametric inference. Aust. N.Z. J. Stat., 51, 115–126.



中文翻译:

高维统计:一种非渐近观点,马丁·J·温赖特(Martin J.Wainwright),剑桥大学出版社,2019年,xvii 552页,57.99英镑,精装本ISBN:978-1-1084-9802-9

读者群:统计/机器学习研究生和研究人员。

这是一本好书。它提供了对非渐进高维统计理论的清晰,可访问和深入的处理,这对于现代统计和机器学习的基础至关重要。它在提供高维统计数据的独立概述方面非常出色,适用于正式课程或研究生水平的学生或研究人员的自学。处理非常清晰和引人入胜,并且生产是一流的。它很快将成为必不可少的阅读材料和该领域的关键参考文本。

1900年代初期发展起来的传统,经典统计建立在一种渐进体制的基础上,在该体制中,统计模型中参数的维数p固定为样本量n增长到无限 然后,通常基于最大似然估计的渐近一致性,正态性和效率,提供大量的标准定律和中心极限定理。从1980年代开始,这种推论技术一直是统计分析的基础,从1980年代开始,通过发展精细的,基于似然性和自举的分布逼近方法进行了扩展:这种实质性推论的一些关键要素简要评论文章Young(2009)中描述了“高级无症状”。古典理论中的典型焦点是由p维参数索引的参数模型Fy ; θθ =(ψλ),其中ψ是一个标量兴趣参数,而λ是一个(p -1)维扰动参数。上推断ψ是基于随机样本ÿ大小的ñ˚FÝ ; θ),并且例如,需要构建一个置信区间αÝ)为ψ,标称覆盖1- α。对“枢轴” TYθ)的采样分布进行自举或解析近似。),例如似然比统计量的有符号平方根或对其进行一些修改。然后,该估算的采样分布用于建立精确的信心集αÝ),与属性,有效假定只有正确性模型的˚FÝ ; θ),即

P [R θ { ψ 一世 α ÿ } = 1个 - α + Ø ñ - [R
对于可量化的r,通常r = 1或r = 3/2。尽管误差项On - r)通常取决于未知量,所以这样的结果是一个渐近陈述,其操作解释是立即的:当n∞时,置信度集在重复的情况下恰好产生名义覆盖率1- α采样。在λ为低维的情况下,根据经验法则,如果r = 3/2,则置信度集将为中等样本大小(例如n)提供基本准确的覆盖范围= 10,20,尽管对于任何有限的n,误差大小的理论都无法保证。

但是,在现代科学和工程学的许多领域中出现的数据集通常具有相同的参数维p,并且常常超过样本大小n,并且对于此类问题,经典统计理论可能无法提供有用的估计或预测,或者只需彻底分解即可。尽管研究已经确定,在参数维数p与可用样本量n相比相对较大的情况下,通常可以从高阶渐近方法中获得准确的推论(例如,参见Barndorff-Nielsen&Cox,1994年所包含的证据),对于Wainwright所说的“高维渐近性”,其中对(np)同时被带到无穷大,几乎没有系统的直接理论检验。,使得(np)的某些缩放函数以及可能的其他问题参数保持固定或收敛到某个有限极限。相反,对现代统计理论的关注强调了高维问题中的非渐近结果。在这个理论中,对(np)以及其他问题参数被视为固定的,并且根据它们来做出有关参数估计器错误的高概率语句。顾名思义,这种类型的非渐近理论结果是本书的重点。第1章进行了漂亮的概述,通过涉及线性判别分析,协方差估计和非参数回归的关键示例进行了说明,说明了高维经典统计可能出什么问题,并有说服力地激发了非渐近观点。书中提出的一种非渐近理论,其基础是获取随机量或浓度不等式的尾部边界,该不等式为随机变量如何偏离某个值(例如均值)提供了边界。np)增长,线性判别问题中的错误概率。此外,非渐近分析中出现的缩放函数可以建议执行适当的高维渐近分析,以揭示相关的极限分布行为。非渐近分析通常不使用或立即可用,以提供有关给定有限(np)的估计量行为的精确陈述。

作为非渐近结果的一个示例,我们以Wainwright为例(示例7.14)进行说明,涉及经典线性模型

ÿ = X β + ϵ
设计矩阵在哪里 X [R ñ × p 是确定性的,并且噪声矢量ε具有独立的元素,同分布为Ñ(0,σ 2)。假设X满足某些(可检查的,因为X是固定的)“受限特征值”条件,并且满足 最高 Ĵ = 1个 p X Ĵ 2 ñ C 其中X ĴĴ理论值的列X。进一步假设矢量β被支撑在一个子集小号⊆{1,2,...,p }的| S | = s,因此β稀疏,具有s个非零元素。如果我们定义套索估计器 β ^ β为一体的minimiser
1个 2 ñ ÿ - X β 2 2 + λ β 1个
对于常数K,我们有
β ^ - β 2 ķ s 2 日志 p ñ + δ
至少有概率 1个 - 2 Ë - ñ δ 2 / 2 ,对于任何δ > 0,如果我们设置 λ = 2 C σ 2 日志 p ñ + δ 。然后,我们看到的是,提供ķ,这是在以下方面定义Çσ 2和本征值条件,停留固定为(Ñp)增加,套索估计 β ^ 只要是一致的 日志 p 如果由s确定的真β的大小保持固定,则nn主导。但是,这样的结果对估计器的有限样本行为的预测很少。 β ^ :的界限 β ^ - β 2 尤其取决于,在未知误差方差σ 2和未知的真正的β,通过其稀疏性水平小号。正如针对固定p制定的经典统计理论一样因此,这种非渐近结果只能提供精确的渐近保证。但是,Wainwright所描述的高维统计理论之美恰恰在于,非渐近结果可以为实用的数据分析方法(例如上述的套索)提供强大的操作支持。例如,当套索估计是不是普遍最优的,它接近模仿一个oracle估计的特性(它知道自然界的真实状态)在许多稀疏的设置,在平均估计ËŸ)的ÿX给出。套索预测EY)几乎和知道哪个元素的先知一样。β为非零。

Wainwright的书的第1章还非常清楚地说明了在高维环境中启用统计的原因。让我们省去的是合理的期望,即高维数据实际上具有某种形式的低维结构,通常是某种形式的稀疏性,这可以粗略地表示为意味着模型的p个参数中只有s个是非零或不可忽略,其中s远小于p。因此,许多高维统计信息都涉及构建本质上高维现象的模型,但是其中的模型合并了一些低维结构的隐式形式,可以从样本数据中成功揭示这些隐式形式。介绍性章节本身就是一个导游,但它为高维统计理论的所有主要要素的数学和方法学提供了奇妙的解释。精美的路标,本书的后续材料确实分为两种。简洁地介绍了有关工具和技术的基础材料,例如浓度不等式,量度集中,大量均匀定律,覆盖和堆积的概念,再现内核希尔伯特空间以及用于获得minimax下界的技术。本文本身具有数学上的意义,旨在指导您推论广泛适用于高维统计的理论。对于那些对统计实践感兴趣的人,至关重要的是,该书还全面介绍了数据分析中使用的模型和估计量。课文包括一系列章节,每个章节都针对特定类别的统计估计问题,包括协方差估计,稀疏线性模型,主成分分析,基于可分解正则化器的估计,低秩矩阵的估计,图形模型和最小二乘估计在非参数设置中:主要目的是阐明广泛使用的估算技术所提供的理论保证。在整个计算过程中,我们始终要格外注意,例如,从最初的NP硬性优化问题得出的收益,即可以有效优化的凸准则,同时确保所得的统计过程几乎与最初考虑的情况一样好。全文中均通过精心选择的示例(主要是理论性质而不是数据分析性质)来说明思想和形式。这位审阅者的一个小错误是缺乏术语表:一个没有沉迷于该领域的人可能会想知道precisely和 全文中均通过精心选择的示例(主要是理论性质而不是数据分析性质)来说明思想和形式。这位审阅者的一个小错误是缺乏术语表:一个没有沉迷于该领域的人可能会想知道wonder和 全文中均通过精心选择的示例(主要是理论性质而不是数据分析性质)来说明思想和形式。这位审阅者的一个小错误是缺乏术语表:一个没有沉迷于该领域的人可能会想知道wonder和 意思。

总之,这是关于高维统计量的权威,学术性和高度有用的摘要。尽管作者在材料中提供了出色的路线图,但它是密集的而不是虚弱的。作为文本,它与Giroud(2015)以及Bühlmann和van de Geer(2011)进行了自然比较,但建议此处的任何偏好都是令人讨厌的。

参考文献

Barndorff-Nielsen,OE和Cox,DR(1994)。推理与渐近性。CRC出版社。

Bühlmann,P.和van de Geer,S.(2011年)。高维数据的统计:方法,理论和应用。施普林格。

Giroud,C.(2015年)。高维统计简介。CRC出版社。

Young,GA(2009)。路由到参数推论的高阶精度。澳洲 NZJ统计 51,115-126。

更新日期:2020-04-12
down
wechat
bug