Data Twinning,Statistical Analysis and Data Mining

当前位置： X-MOL 学术 › Stat. Anal. Data Min. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Data Twinning
Statistical Analysis and Data Mining ( IF 2.1 ) Pub Date : 2022-02-15 , DOI: 10.1002/sam.11574
Akhil Vakayil ₁ , V. Roshan Joseph ₁

Affiliation

In this work, we develop a method named Twinning for partitioning a dataset into statistically similar twin sets. Twinning is based on SPlit, a recently proposed model-independent method for optimally splitting a dataset into training and testing sets. Twinning is orders of magnitude faster than the SPlit algorithm, which makes it applicable to Big Data problems such as data compression. Twinning can also be used for generating multiple splits of a given dataset to aid divide-and-conquer procedures and k-fold cross validation.

中文翻译：

数据孪生

在这项工作中，我们开发了一种名为孪生的方法，用于将数据集划分为统计上相似的孪生集。孪生基于SPlit，这是一种最近提出的与模型无关的方法，用于将数据集最佳地拆分为训练集和测试集。孪生比 SPlit 算法快几个数量级，这使其适用于数据压缩等大数据问题。孪生还可用于生成给定数据集的多个拆分，以帮助分而治之的过程和k折交叉验证。

更新日期：2022-02-15

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11