当前位置: X-MOL 学术Stat. Anal. Data Min. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Data Twinning
Statistical Analysis and Data Mining ( IF 2.1 ) Pub Date : 2022-02-15 , DOI: 10.1002/sam.11574
Akhil Vakayil 1 , V. Roshan Joseph 1
Affiliation  

In this work, we develop a method named Twinning for partitioning a dataset into statistically similar twin sets. Twinning is based on SPlit, a recently proposed model-independent method for optimally splitting a dataset into training and testing sets. Twinning is orders of magnitude faster than the SPlit algorithm, which makes it applicable to Big Data problems such as data compression. Twinning can also be used for generating multiple splits of a given dataset to aid divide-and-conquer procedures and k-fold cross validation.

中文翻译:

数据孪生

在这项工作中,我们开发了一种名为孪生的方法用于将数据集划分为统计上相似的孪生集。孪生基于SPlit,这是一种最近提出的与模型无关的方法,用于将数据集最佳地拆分为训练集和测试集。孪生比 SPlit 算法快几个数量级使其适用于数据压缩等大数据问题。孪生还可用于生成给定数据集的多个拆分,以帮助分而治之的过程和k折交叉验证。
更新日期:2022-02-15
down
wechat
bug