当前位置: X-MOL 学术Ann. Stat. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Adaptive estimation in structured factor models with applications to overlapping clustering
Annals of Statistics ( IF 4.5 ) Pub Date : 2020-08-01 , DOI: 10.1214/19-aos1877
Xin Bing , Florentina Bunea , Yang Ning , Marten Wegkamp

This work introduces a novel estimation method, called LOVE, of the entries and structure of a loading matrix A in a sparse latent factor model X = AZ + E, for an observable random vector X in Rp, with correlated unobservable factors Z \in RK, with K unknown, and independent noise E. Each row of A is scaled and sparse. In order to identify the loading matrix A, we require the existence of pure variables, which are components of X that are associated, via A, with one and only one latent factor. Despite the fact that the number of factors K, the number of the pure variables, and their location are all unknown, we only require a mild condition on the covariance matrix of Z, and a minimum of only two pure variables per latent factor to show that A is uniquely defined, up to signed permutations. Our proofs for model identifiability are constructive, and lead to our novel estimation method of the number of factors and of the set of pure variables, from a sample of size n of observations on X. This is the first step of our LOVE algorithm, which is optimization-free, and has low computational complexity of order p2. The second step of LOVE is an easily implementable linear program that estimates A. We prove that the resulting estimator is minimax rate optimal up to logarithmic factors in p. The model structure is motivated by the problem of overlapping variable clustering, ubiquitous in data science. We define the population level clusters as groups of those components of X that are associated, via the sparse matrix A, with the same unobservable latent factor, and multi-factor association is allowed. Clusters are respectively anchored by the pure variables, and form overlapping sub-groups of the p-dimensional random vector X. The Latent model approach to OVErlapping clustering is reflected in the name of our algorithm, LOVE.

中文翻译:

应用于重叠聚类的结构化因子模型中的自适应估计

这项工作引入了一种新的估计方法,称为 LOVE,在稀疏潜在因子模型 X = AZ + E 中加载矩阵 A 的条目和结构,对于 Rp 中的可观察随机向量 X,具有相关的不可观察因子 Z \in RK ,K 未知,独立噪声 E。A 的每一行都经过缩放和稀疏。为了识别加载矩阵 A,我们需要存在纯变量,它们是 X 的组件,通过 A 与一个且只有一个潜在因子相关联。尽管因子 K 的数量、纯变量的数量以及它们的位置都是未知的,但我们只需要 Z 的协方差矩阵的温和条件,并且每个潜在因子最少只有两个纯变量来显示A 是唯一定义的,直到有符号排列。我们对模型可识别性的证明是有建设性的,并导致我们从 X 上观察大小为 n 的样本中对因子数量和纯变量集的新估计方法。这是我们的 LOVE 算法的第一步,该算法无需优化,并且具有低p2阶的计算复杂度。LOVE 的第二步是一个很容易实现的线性程序,它估计 A。我们证明得到的估计器是最小最大速率最优,直到 p 中的对数因子。模型结构是由数据科学中普遍存在的重叠变量聚类问题驱动的。我们将人口级别的集群定义为 X 的那些组件的组,这些组件通过稀疏矩阵 A 与相同的不可观察的潜在因素相关联,并且允许多因素关联。簇分别由纯变量锚定,
更新日期:2020-08-01
down
wechat
bug