Abstract
High dimensional data sets against the small sample size is essential for most of the sciences. The variable selection contributes to a better prediction of real-life phenomena. A multivariate approach called partial least squares (PLS) has the potential to model the high dimensional data, where the sample size is usually smaller than the number of variables. Truncation for variables selection in PLS \(T-PLS\) is considered a reference method. \(T-PLS\) and many others only monitors the location of PLS loading weights for variable selection. In the current article, we propose to monitor both location and dispersion of PLS loading weights for variable selection over the high dimensional spectral data. The proposed PLS variants are based on location, dispersion, both location and dispersion and at least location or dispersion monitoring of \(PLS\) loading weights, and are denoted by \(X-PLS\), \(S-PLS\), \(X \& S-PLS\) and \(X|S-PLS\) respectively. Proposed PLS variants are compared with standard PLS and \(T-PLS\) through the Monte Carlo simulation of 100 runs on simulated and real data sets which includes corn, milk, and oil contents prediction based on spectroscopic data. \(X \& S-PLS\) shows the best capability in selecting the real variables over the simulated data. The validated RMSE comparison indicates \(X|S-PLS\) and \(X \& S-PLS\) outperforms compared to other methods in predicting corn, milk, and oil contents. \(X \& S-PLS\) selects the smallest number of variables. Interestingly, selected variables by \(X \& S-PLS\) are more consistent compared to all other methods. Hence \(X \& S-PLS\) appears a potential candidate for variable selection in high dimensional data.
Similar content being viewed by others
References
Afseth, N. K., Segtnan, V. H., & Wold, J. P. (2006). Raman spectra of biological samples: A study of preprocessing methods. Applied Spectroscopy, 60, 1358–1367.
Bersimis, S., Psarakis, S., & Panaretos, J. (2007). Multivariate statistical process control charts: An overview. Quality and Reliability Engineering International, 23, 517–543.
Chen, G., Cheng, S. W., & Xie, H. (2005). A new multivariate control chart for monitoring both location and dispersion. Communications in StatisticsSimulation and Computation R, 34, 203–217.
Eilers, P. H. (2004). Parametric time warping. Analytical Chemistry, 76, 404–411.
Eriksson, L., Byrne, T., Johansson, E., Trygg, J., & Vikström, C. (2013). Multi-and megavariate data analysis basic principles and applications. Umetrics Academy.
Frank, I. (1987). Intermediate least squares regression method. Chemometrics and Intelligent Laboratory Systems, 1, 233–242.
Frenich, A., Jouan-Rimbaud, D., Massart, D., Kuttatharmmakul, S., Galera, M., & Vidal, J. (1995). Wavelength selection method for multicomponent spectrophotometric determinations using partial least squares. Analyst, 120, 2787–2792.
Keleş, S., & Chun, H. (2008). Comments on: Augmenting the bootstrap to analyze high dimensional genomic data. TEST, 17, 36–39.
Kourti, T., & MacGregor, J. F. (1996). Multivariate spc methods for process and product monitoring. Journal of Quality Technology, 28, 409–428.
Liland, K. H., Almøy, T., & Mevik, B.-H. (2010). Optimal choice of baseline correction for multivariate calibration of spectra. Applied Spectroscopy, 64, 1007–1016.
Liland, K. H., Høy, M., Martens, H., & Sæbø, S. (2013). Distribution based truncation for variable selection in subspace methods for multivariate regression. Chemometrics and Intelligent Laboratory Systems, 122, 103–111.
Liland, K. H., Mevik, B.-H., Rukke, E.-O., Almøy, T., Skaugen, M., & Isaksson, T. (2009). Quantitative whole spectrum analysis with maldi-tof ms, part I: Measurement optimisation. Chemometrics and Intelligent Laboratory Systems, 96, 210–218.
Liland, K. H., Rukke, E.-O., Olsen, E. F., & Isaksson, T. (2011). Customized baseline correction. Chemometrics and Intelligent Laboratory Systems, 109, 51–56.
MacGregor, J. F., & Kourti, T. (1995). Statistical process control of multivariate processes. Control Engineering Practice, 3, 403–414.
Martens, H., & Næs, T. (1989). Multivariate calibration. New York: Wiley.
Martin, E., Morris, A., & Zhang, J. (1996). Process performance monitoring using multivariate statistical process control. IEE Proceedings-Control Theory and Applications, 143, 132–144.
Mehmood, T. (2016). Hotelling t 2 based variable selection in partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 154, 23–28.
Mehmood, T., Liland, K. H., Snipen, L., & Sæbø, S. (2012). A review of variable selection methods in partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 118, 62–69.
Mehmood, T., Martens, H., Sæbø, S., Warringer, J., & Snipen, L. (2011). A partial least squares based algorithm for parsimonious variable selection. Algorithms for Molecular Biology, 6, 27.
Mehmood, T., Sæbø, S., & Liland, K. H. (2020). Comparison of variable selection methods in partial least squares regression. Journal of Chemometrics, 2020, e3226.
Montgomery, D. C. (2007). Introduction to statistical quality control. New York: Wiley.
Norgaard, L., Saudland, A., Wagner, J., Nielsen, J., Munck, L., & Engelsen, S. (2000). Interval partial least-squares regression (iPLS): A comparative chemometric study with an example from near-infrared spectroscopy. Applied Spectroscopy, 54, 413–419.
Raouf, A., Duffuaa, S., Ben-Daya, M., Costa, A., & Rahim, M. (2006). A synthetic control chart for monitoring the process mean and variance. Journal of Quality in Maintenance Engineering 1.
Sæbø, S., Almøy, T., Aarøe, J., & Aastveit, A. H. (2007). St-pls: A multi-dimensional nearest shrunken centroid type classifier via pls. Jornal of Chemometrics, 20, 54–62.
Wold, S., Martens, H., & Wold, H. (1983). The multivariate calibration problem in chemistry solved by the PLS method. In Conference proceeding matrix pencils (pp. 286–293).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mehmood, T., Turk, A.M. Variable selection of spectroscopic data through monitoring both location and dispersion of PLS loading weights. J. Korean Stat. Soc. 50, 905–917 (2021). https://doi.org/10.1007/s42952-020-00098-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42952-020-00098-x