Sampling-based dimension reduction for subspace approximation with outliers,Theoretical Computer Science

当前位置： X-MOL 学术 › Theor. Comput. Sci. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Sampling-based dimension reduction for subspace approximation with outliers
Theoretical Computer Science ( IF 0.9 ) Pub Date : 2021-01-13 , DOI: 10.1016/j.tcs.2021.01.021
Amit Deshpande , Rameshwar Pratap

The subspace approximation problem with outliers, for given n points in d dimensions $x_{1}, x_{2}, \dots, x_{n} \in R^{d}$ , an integer $1 \leq k \leq d$ , and an outlier parameter $0 \leq α \leq 1$ , is to find a k-dimensional linear subspace of $R^{d}$ that minimizes the sum of squared distances to its nearest $(1 - α) n$ points. More generally, the $ℓ_{p}$ subspace approximation problem with outliers minimizes the sum of p-th powers of distances instead of the sum of squared distances. Even the case of $p = 2$ or robust PCA is non-trivial, and previous work requires additional assumptions on the input or generative models for it. Any multiplicative approximation algorithm for the subspace approximation problem with outliers must solve the robust subspace recovery problem, a special case in which the $(1 - α) n$ inliers in the optimal solution are promised to lie exactly on a k-dimensional linear subspace. However, robust subspace recovery is Small Set Expansion (SSE)-hard, and known algorithmic results for robust subspace recovery require strong assumptions on the input, e.g., any d outliers must be linearly independent.

In this paper, we show how to extend dimension reduction techniques and bi-criteria approximations based on sampling and coresets to the problem of subspace approximation with outliers. To get around the SSE-hardness of robust subspace recovery, we assume that the squared distance error of the optimal k-dimensional subspace summed over the optimal $(1 - α) n$ inliers is at least δ times its squared-error summed over all n points, for some $0 < δ \leq 1 - α$ . Under this assumption, we give an efficient algorithm to find a weak coreset or a subset of $poly (k / ϵ) \log (1 / δ) \log \log (1 / δ)$ points whose span contains a k-dimensional subspace that gives a multiplicative $(1 + ϵ)$ -approximation to the optimal solution. Our technique is based on the squared-length sampling algorithm suggested for low-rank approximation problems in the seminal work of Frieze, Kannan, and Vempala [12]. The running time of our algorithm is linear in n and d. Interestingly, our results hold even when the fraction of outliers α is large, as long as the obvious condition $0 < δ \leq 1 - α$ is satisfied. We show similar results for subspace approximation with $ℓ_{p}$ error or more general M-estimator loss functions, and also give an additive approximation for the affine subspace approximation problem.

中文翻译：

基于样本的维降用于离群值的子空间逼近

对于d维中给定的n点，具有离群值的子空间近似问题 $X_{1个} ， X_{2} ， \dots ， X_{ñ} \in {[R}^{d}$ ，一个整数 $1个 \leq ķ \leq d$ 和一个异常值参数 $0 \leq α \leq 1个$ ，是找到一个k维线性子空间 ${[R}^{d}$ 最小化最接近的平方距离之和 $（ 1个 - α ） ñ$ 点。更普遍地， $ℓ_{p}$ 具有离群值的子空间逼近问题使距离的p次方之和而不是距离的平方之和最小。即使是这样 $p = 2$ 健壮的PCA是不平凡的，以前的工作需要对其输入或生成模型进行其他假设。对于具有异常值的子空间逼近问题，任何乘法逼近算法都必须解决鲁棒的子空间恢复问题，在这种特殊情况下， $（ 1个 - α ） ñ$ 最优解中的小数被保证恰好位于k维线性子空间上。但是，鲁棒的子空间恢复很难用小集扩展（SSE）进行，而鲁棒的子空间恢复的已知算法结果需要对输入有严格的假设，例如，任何d异常值都必须线性独立。

在本文中，我们展示了如何将基于采样和核集的降维技术和双准则逼近扩展到具有离群值的子空间逼近问题。为了避开鲁棒子空间恢复的SSE-hardness，我们假设最优k维子空间的平方距离误差在最优子空间上求和 $（ 1个 - α ） ñ$ 正常值至少δ倍平方误差求和所有ñ点，对于一些 $0 < δ \leq 1个 - α$ 。在此假设下，我们提供了一种有效的算法来查找弱核心集或 $聚（ ķ / ϵ ）日志（ 1个 / δ ）日志日志（ 1个 / δ ）$ 点的跨度包含k维子空间，该子空间可提供乘法 $（ 1个 + ϵ ）$ -逼近最佳解。我们的技术基于Frieze，Kannan和Vempala [12]的开创性工作中针对低秩逼近问题建议的平方长度采样算法。我们的算法的运行时间在n和d中是线性的。有趣的是，只要异常条件明显，即使离群值α的比例很大，我们的结果仍然成立 $0 < δ \leq 1个 - α$ 很满意。对于子空间近似，我们显示了相似的结果 $ℓ_{p}$ 误差或更一般的M估计量损失函数，并且还针对仿射子空间逼近问题给出了加法逼近。

更新日期：2021-02-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11