Technometrics ( IF 2.5 ) Pub Date : 2021-06-01 , DOI: 10.1080/00401706.2021.1921037 V. Roshan Joseph 1 , Akhil Vakayil 1
Abstract
In this article, we propose an optimal method referred to as SPlit for splitting a dataset into training and testing sets. SPlit is based on the method of support points (SP), which was initially developed for finding the optimal representative points of a continuous distribution. We adapt SP for subsampling from a dataset using a sequential nearest neighbor algorithm. We also extend SP to deal with categorical variables so that SPlit can be applied to both regression and classification problems. The implementation of SPlit on real datasets shows substantial improvement in the worst-case testing performance for several modeling methods compared to the commonly used random splitting procedure.
中文翻译:
SPlit:数据拆分的最佳方法
摘要
在本文中,我们提出了一种称为 SPlit 的最佳方法,用于将数据集拆分为训练集和测试集。SPlit 基于支持点(SP)的方法,该方法最初是为寻找连续分布的最佳代表点而开发的。我们使用顺序最近邻算法调整 SP 以从数据集中进行二次采样。我们还扩展了 SP 以处理分类变量,以便 SPlit 可以应用于回归和分类问题。与常用的随机分裂过程相比,在真实数据集上实施 SPlit 显示了几种建模方法的最坏情况测试性能显着提高。