当前位置: X-MOL 学术J. Korean Stat. Soc. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Variable selection of spectroscopic data through monitoring both location and dispersion of PLS loading weights
Journal of the Korean Statistical Society ( IF 0.6 ) Pub Date : 2021-01-19 , DOI: 10.1007/s42952-020-00098-x
Tahir Mehmood , Arslan Munir Turk

High dimensional data sets against the small sample size is essential for most of the sciences. The variable selection contributes to a better prediction of real-life phenomena. A multivariate approach called partial least squares (PLS) has the potential to model the high dimensional data, where the sample size is usually smaller than the number of variables. Truncation for variables selection in PLS \(T-PLS\) is considered a reference method. \(T-PLS\) and many others only monitors the location of PLS loading weights for variable selection. In the current article, we propose to monitor both location and dispersion of PLS loading weights for variable selection over the high dimensional spectral data. The proposed PLS variants are based on location, dispersion, both location and dispersion and at least location or dispersion monitoring of \(PLS\) loading weights, and are denoted by \(X-PLS\), \(S-PLS\), \(X \& S-PLS\) and \(X|S-PLS\) respectively. Proposed PLS variants are compared with standard PLS and \(T-PLS\) through the Monte Carlo simulation of 100 runs on simulated and real data sets which includes corn, milk, and oil contents prediction based on spectroscopic data. \(X \& S-PLS\) shows the best capability in selecting the real variables over the simulated data. The validated RMSE comparison indicates \(X|S-PLS\) and \(X \& S-PLS\) outperforms compared to other methods in predicting corn, milk, and oil contents. \(X \& S-PLS\) selects the smallest number of variables. Interestingly, selected variables by \(X \& S-PLS\) are more consistent compared to all other methods. Hence \(X \& S-PLS\) appears a potential candidate for variable selection in high dimensional data.



中文翻译:

通过监视PLS加载权重的位置和分散性来可变选择光谱数据

对于大多数科学而言,相对于小样本量的高维数据集至关重要。变量选择有助于更好地预测现实生活中的现象。一种称为偏最小二乘(PLS)的多元方法具有对高维数据建模的潜力,其中样本大小通常小于变量数。在PLS \(T-PLS \)中选择变量的截断被认为是一种参考方法。\(T-PLS \)许多其他变量仅监视PLS加载权重的位置以进行变量选择。在当前的文章中,我们建议监视PLS加载权重的位置和分散性,以在高维光谱数据上进行变量选择。提出的PLS变量基于位置,分散,位置和分散以及至少对\(PLS \)负载权重的位置或分散监控,并由\(X-PLS \)\(S-PLS \)表示\(X \&S-PLS \)\(X | S-PLS \)。建议的PLS变体与标准PLS和\(T-PLS \)进行比较通过对100个数据的蒙特卡罗模拟,可以在模拟和真实数据集上运行,这些数据集包括基于光谱数据的玉米,牛奶和油含量预测。\(X \&S-PLS \)显示了在模拟数据上选择实变量的最佳能力。经过验证的RMSE比较表明\(X | S-PLS \)\(X \&S-PLS \)在预测玉米,牛奶和油脂含量方面优于其他方法。\(X \&S-PLS \)选择最小数量的变量。有趣的是,与所有其他方法相比,通过\(X \&S-PLS \)选择的变量更加一致。因此,\(X \&S-PLS \)似乎是高维数据中变量选择的潜在候选者。

更新日期:2021-01-19
down
wechat
bug