Supervised compression of big data,Statistical Analysis and Data Mining

当前位置： X-MOL 学术 › Stat. Anal. Data Min. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Supervised compression of big data
Statistical Analysis and Data Mining ( IF 2.1 ) Pub Date : 2021-04-08 , DOI: 10.1002/sam.11508
V. Roshan Joseph ₁ , Simon Mak ₂

Affiliation

The phenomenon of big data has become ubiquitous in nearly all disciplines, from science to engineering. A key challenge is the use of such data for fitting statistical and machine learning models, which can incur high computational and storage costs. One solution is to perform model fitting on a carefully selected subset of the data. Various data reduction methods have been proposed in the literature, ranging from random subsampling to optimal experimental design‐based methods. However, when the goal is to learn the underlying input–output relationship, such reduction methods may not be ideal, since it does not make use of information contained in the output. To this end, we propose a supervised data compression method called supercompress, which integrates output information by sampling data from regions most important for modeling the desired input–output relationship. An advantage of supercompress is that it is nonparametric—the compression method does not rely on parametric modeling assumptions between inputs and output. As a result, the proposed method is robust to a wide range of modeling choices. We demonstrate the usefulness of supercompress over existing data reduction methods, in both simulations and a taxicab predictive modeling application.

中文翻译：

监督大数据压缩

大数据的现象在从科学到工程的几乎所有学科中已经无处不在。关键挑战是如何使用此类数据来拟合统计模型和机器学习模型，这会招致高昂的计算和存储成本。一种解决方案是对经过精心选择的数据子集执行模型拟合。文献中提出了各种数据缩减方法，从随机子采样到基于最佳实验设计的方法不等。但是，当目标是学习基本的投入产出关系时，这种简化方法可能并不理想，因为它没有利用产出中包含的信息。为此，我们提出了一种有监督的数据压缩方法，称为supercompress，它通过对最重要的区域进行建模以对所需的输入-输出关系进行建模来对数据进行采样来集成输出信息。超压缩的一个优点是它是非参数的-压缩方法不依赖输入和输出之间的参数建模假设。结果，所提出的方法对于广泛的建模选择是鲁棒的。我们在仿真和出租车预测建模应用程序中展示了超压缩优于现有数据缩减方法的有用性。

更新日期：2021-05-04

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11