Sample-size dependence of validation parameters in linear regression models and in QSAR,SAR and QSAR in Environmental Research

当前位置： X-MOL 学术 › SAR QSAR Environ. Res. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Sample-size dependence of validation parameters in linear regression models and in QSAR
SAR and QSAR in Environmental Research ( IF 3 ) Pub Date : 2021-03-22 , DOI: 10.1080/1062936x.2021.1890208
D. Kovács ₁ , P. Király ₁ , G. Tóth ₁

Affiliation

ABSTRACT

The dependence of statistical validation parameters was investigated on the size of the sample taken in fit of multivariate linear curves. We observed that R² and related internal parameters were misleading as they overestimated the goodness-of-fit of models at small sample size. Cross-validation metrics showed correct trends. It was possible to scale the leave-one-out and the leave-many-out results close to identical by correcting the degrees of freedom of the models. y and x-randomized validation parameters were calculated and the methods provided close to identical results. We suggest to use the simplest methods in both cases. The external parameters followed correct trends with respect to the sample size, but their sensitivity differed. We plotted the Roy-Ojha metrics in 2D and we coloured them with respect to other external parameters to provide an easy classification of models. The rank correlations were calculated between the performance parameters. Up to a sample size, goodness-of-fit and robustness were distinguishable, but above a certain sample size, the parameters were redundant. The external-internal pairs were weakly correlated. Our data show that all the three aspects of validation are necessary at small sample sizes, but the internal check of robustness is not informative above a given sample size.

中文翻译：

线性回归模型和QSAR中验证参数的样本大小依赖性

摘要

统计验证参数的依存关系对样本的大小进行了调查，该样本符合多变量线性曲线。我们观察到R ²和相关的内部参数具有误导性，因为它们高估了小样本量模型的拟合优度。交叉验证指标显示出正确的趋势。通过校正模型的自由度，可以将留一法和多留法的结果缩放到接近相同的程度。y和x计算了随机验证参数，并提供了接近相同结果的方法。我们建议在两种情况下都使用最简单的方法。外部参数在样本量方面遵循正确的趋势，但其灵敏度有所不同。我们以2D方式绘制了Roy-Ojha度量，并针对其他外部参数对它们进行了着色，以提供模型的简单分类。计算性能参数之间的等级相关性。在不超过样本量的情况下，拟合优度和鲁棒性是可区分的，但是在一定样本量以上时，参数是多余的。内外对之间的相关性较弱。我们的数据表明，在小样本量下，验证的所有三个方面都是必要的，但是在给定样本量以上，内部鲁棒性检查不能提供任何信息。

更新日期：2021-03-31

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>