当前位置: X-MOL 学术Comb. Chem. High Throughput Screen. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Variable screening for near infrared (NIR) spectroscopy data based on ridge partial least squares regression.
Combinatorial Chemistry & High Throughput Screening ( IF 1.8 ) Pub Date : 2020-08-31 , DOI: 10.2174/1386207323666200428114823
Naifei Zhao 1 , Qingsong Xu 2 , Man-Lai Tang 3 , Hong Wang 2
Affiliation  

Aim and Objective: Near Infrared (NIR) spectroscopy data are featured by few dozen to many thousands of samples and highly correlated variables. Quantitative analysis of such data usually requires a combination of analytical methods with variable selection or screening methods. Commonly-used variable screening methods fail to recover the true model when (i) some of the variables are highly correlated, and (ii) the sample size is less than the number of relevant variables. In these cases, Partial Least Squares (PLS) regression based approaches can be useful alternatives.

Materials and Methods: In this research, a fast variable screening strategy, namely the preconditioned screening for ridge partial least squares regression (PSRPLS), is proposed for modelling NIR spectroscopy data with high-dimensional and highly correlated covariates. Under rather mild assumptions, we prove that using Puffer transformation, the proposed approach successfully transforms the problem of variable screening with highly correlated predictor variables to that of weakly correlated covariates with less extra computational effort.

Results: We show that our proposed method leads to theoretically consistent model selection results. Four simulation studies and two real examples are then analyzed to illustrate the effectiveness of the proposed approach.

Conclusion: By introducing Puffer transformation, high correlation problem can be mitigated using the PSRPLS procedure we construct. By employing RPLS regression to our approach, it can be made more simple and computational efficient to cope with the situation where model size is larger than the sample size while maintaining a high precision prediction.



中文翻译:

基于岭偏最小二乘回归的近红外(NIR)光谱数据可变筛选。

目的和目标:近红外(NIR)光谱数据具有几十到数千个样品以及高度相关的变量。对此类数据进行定量分析通常需要将分析方法与变量选择或筛选方法结合起来。当(i)一些变量高度相关,并且(ii)样本数量小于相关变量的数量时,常用的变量筛选方法无法恢复真实模型。在这些情况下,基于偏最小二乘(PLS)回归的方法可能是有用的替代方法。

材料和方法:在这项研究中,提出了一种快速变量筛选策略,即对岭偏最小二乘回归(PSRPLS)进行预处理筛选,以对具有高维和高度相关协变量的NIR光谱数据进行建模。在相当温和的假设下,我们证明使用Puffer变换,该方法可以将具有高相关性的预测变量的变量筛选问题成功转换为具有低相关性的协变量的问题,而无需花费额外的计算工作。

结果:我们表明,我们提出的方法导致理论上一致的模型选择结果。然后,对四个仿真研究和两个实际示例进行了分析,以说明所提出方法的有效性。

结论:通过引入Puffer变换,可以使用我们构建的PSRPLS程序来缓解高相关性问题。通过对我们的方法采用RPLS回归,可以在保持高精度预测的同时,使模型大小大于样本大小的情况变得更简单且计算效率更高。

更新日期:2020-11-02
down
wechat
bug