当前位置: X-MOL 学术Comput. Stat. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis?
Computational Statistics ( IF 1.3 ) Pub Date : 2020-06-13 , DOI: 10.1007/s00180-020-00999-9
Bruce G. Marcot , Anca M. Hanea

Cross-validation using randomized subsets of data—known as k-fold cross-validation—is a powerful means of testing the success rate of models used for classification. However, few if any studies have explored how values of k (number of subsets) affect validation results in models tested with data of known statistical properties. Here, we explore conditions of sample size, model structure, and variable dependence affecting validation outcomes in discrete Bayesian networks (BNs). We created 6 variants of a BN model with known properties of variance and collinearity, along with data sets of n = 50, 500, and 5000 samples, and then tested classification success and evaluated CPU computation time with seven levels of folds (k = 2, 5, 10, 20, n − 5, n − 2, and n − 1). Classification error declined with increasing n, particularly in BN models with high multivariate dependence, and declined with increasing k, generally levelling out at k = 10, although k = 5 sufficed with large samples (n = 5000). Our work supports the common use of k = 10 in the literature, although in some cases k = 5 would suffice with BN models having independent variable structures.



中文翻译:

离散贝叶斯网络分析中的k倍交叉验证中k的最佳值是多少?

使用随机数据子集的交叉验证(称为k折交叉验证)是测试用于分类的模型成功率的有效方法。但是,很少有研究探讨使用已知统计特性数据测试的模型中k值(子集数)如何影响验证结果。在这里,我们探讨了样本大小,模型结构和变量依赖性的条件,这些条件会影响离散贝叶斯网络(BNs)中的验证结果。我们创建了具有已知方差和共线性属性的BN模型的6个变体,以及n = 50、500和5000个样本的数据集,然后测试了分类成功并评估了7倍折叠的CPU计算时间(k = 2 ,5、10、20,n-5,n-2和n-1)。分类误差随着n的增加而降低,尤其是在具有高多变量相关性的BN模型中,并且随着k的增加而下降,通常在k = 10时趋于平稳,尽管k = 5足以满足大样本(n = 5000)。我们的工作支持文献中k = 10的通用用法,尽管在某些情况下k = 5对于具有独立变量结构的BN模型就足够了。

更新日期:2020-06-13
down
wechat
bug