SPlit: An Optimal Method for Data Splitting,Technometrics

当前位置： X-MOL 学术 › Technometrics › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

SPlit: An Optimal Method for Data Splitting
Technometrics ( IF 2.5 ) Pub Date : 2021-06-01 , DOI: 10.1080/00401706.2021.1921037
V. Roshan Joseph ₁ , Akhil Vakayil ₁

Affiliation

Abstract

In this article, we propose an optimal method referred to as SPlit for splitting a dataset into training and testing sets. SPlit is based on the method of support points (SP), which was initially developed for finding the optimal representative points of a continuous distribution. We adapt SP for subsampling from a dataset using a sequential nearest neighbor algorithm. We also extend SP to deal with categorical variables so that SPlit can be applied to both regression and classification problems. The implementation of SPlit on real datasets shows substantial improvement in the worst-case testing performance for several modeling methods compared to the commonly used random splitting procedure.

中文翻译：

SPlit：数据拆分的最佳方法

摘要

在本文中，我们提出了一种称为 SPlit 的最佳方法，用于将数据集拆分为训练集和测试集。SPlit 基于支持点（SP）的方法，该方法最初是为寻找连续分布的最佳代表点而开发的。我们使用顺序最近邻算法调整 SP 以从数据集中进行二次采样。我们还扩展了 SP 以处理分类变量，以便 SPlit 可以应用于回归和分类问题。与常用的随机分裂过程相比，在真实数据集上实施 SPlit 显示了几种建模方法的最坏情况测试性能显着提高。

更新日期：2021-06-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>