High-dimensional model recovery from random sketched data by exploring intrinsic sparsity,Machine Learning

当前位置： X-MOL 学术 › Mach. Learn. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

High-dimensional model recovery from random sketched data by exploring intrinsic sparsity
Machine Learning ( IF 4.3 ) Pub Date : 2020-01-07 , DOI: 10.1007/s10994-019-05865-4
Tianbao Yang , Lijun Zhang , Qihang Lin , Shenghuo Zhu , Rong Jin

Learning from large-scale and high-dimensional data still remains a computationally challenging problem, though it has received increasing interest recently. To address this issue, randomized reduction methods have been developed by either reducing the dimensionality or reducing the number of training instances to obtain a small sketch of the original data. In this paper, we focus on recovering a high-dimensional classification/regression model from random sketched data. We propose to exploit the intrinsic sparsity of optimal solutions and develop novel methods by increasing the regularization parameter before the sparse regularizer. In particular, (i) for high-dimensional classification problems, we leverage randomized reduction methods to reduce the dimensionality of data and solve a dual formulation on the random sketched data with an introduced sparse regularizer on the dual solution; (ii) for high-dimensional sparse least-squares regression problems, we employ randomized reduction methods to reduce the scale of data and solve a formulation on the random sketched data with an increased regularization parameter before the sparse regularizer. For both classes of problems, by exploiting the intrinsic sparsity of the optimal dual solution or the optimal primal solution we provide formal theoretical guarantee on the recovery error of learned models in comparison with the optimal models that are learned from the original data. Compared with previous studies on randomized reduction for machine learning, the present work enjoy several advantages: (i) the proposed formulations enjoys intuitive geometric explanations; (ii) the theoretical guarantee does not rely on any stringent assumptions about the original data (e.g., low-rankness of the data matrix or the data are linearly separable); (iii) the theory covers both smooth and non-smooth loss functions for classification; (iv) the analysis is applicable to a broad class of randomized reduction methods as long as the reduction matrices admit the Johnson–Lindenstrauss type of lemma. We also present empirical studies to support the proposed methods and the presented theory.

中文翻译：

通过探索内在稀疏性从随机草图数据中恢复高维模型

从大规模和高维数据中学习仍然是一个计算上具有挑战性的问题，尽管最近它受到了越来越多的关注。为了解决这个问题，通过降低维度或减少训练实例的数量来获得原始数据的小草图，已经开发了随机化约简方法。在本文中，我们专注于从随机草图数据中恢复高维分类/回归模型。我们建议利用最优解的内在稀疏性，并通过在稀疏正则化器之前增加正则化参数来开发新方法。特别是，（i）对于高维分类问题，我们利用随机化约简方法来降低数据的维数，并在对偶解上引入稀疏正则化器来解决随机草图数据上的对偶公式；(ii) 对于高维稀疏最小二乘回归问题，我们采用随机归约方法来减少数据规模，并在稀疏正则化器之前使用增加的正则化参数解决随机草图数据的公式。对于这两类问题，通过利用最优对偶解或最优原始解的内在稀疏性，我们为学习模型与从原始数据学习的最优模型相比的恢复误差提供了形式化的理论保证。与之前关于机器学习随机归约的研究相比，目前的工作有几个优点：（i）所提出的公式具有直观的几何解释；(ii) 理论保证不依赖于对原始数据的任何严格假设（例如，数据矩阵的低秩或数据是线性可分的）；(iii) 该理论涵盖了用于分类的平滑和非平滑损失函数；(iv) 只要归约矩阵允许 Johnson-Lindenstrauss 类型的引理，该分析就适用于广泛的随机归约方法。我们还提供实证研究来支持所提出的方法和所提出的理论。数据矩阵的低秩或数据线性可分）；(iii) 该理论涵盖了用于分类的平滑和非平滑损失函数；(iv) 只要归约矩阵允许 Johnson-Lindenstrauss 类型的引理，该分析就适用于广泛的随机归约方法。我们还提供实证研究来支持所提出的方法和所提出的理论。数据矩阵的低秩或数据线性可分）；(iii) 该理论涵盖了用于分类的平滑和非平滑损失函数；(iv) 只要归约矩阵承认 Johnson-Lindenstrauss 类型的引理，该分析就适用于广泛的随机归约方法。我们还提供实证研究来支持所提出的方法和所提出的理论。

更新日期：2020-01-07

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11