Abstract
Partial least squares regression – or PLS regression – is a multivariate method in which the model parameters are estimated using either the SIMPLS or NIPALS algorithm. PLS regression has been extensively used in applied research because of its effectiveness in analyzing relationships between an outcome and one or several components. Note that the NIPALS algorithm can provide estimates parameters on incomplete data. The selection of the number of components used to build a representative model in PLS regression is a central issue. However, how to deal with missing data when using PLS regression remains a matter of debate. Several approaches have been proposed in the literature, including the Q2 criterion, and the AIC and BIC criteria. Here we study the behavior of the NIPALS algorithm when used to fit a PLS regression for various proportions of missing data and different types of missingness. We compare criteria to select the number of components for a PLS regression on incomplete data set and on imputed data set using three imputation methods: multiple imputation by chained equations, k-nearest neighbour imputation, and singular value decomposition imputation. We tested various criteria with different proportions of missing data (ranging from 5% to 50%) under different missingness assumptions. Q2-leave-one-out component selection methods gave more reliable results than AIC and BIC-based ones.
References
Akaike, H. (1969): “Fitting autoregressive models for prediction,” Ann. Ins. Stat. Math., 21, 243–247.10.1007/BF02532251Search in Google Scholar
Arteaga, F. and A. Ferrer (2002): “Dealing with missing data in MSPC: Several methods, different interpretations, some examples,” J. Chemom., 16, 408–418.10.1002/cem.750Search in Google Scholar
Azur, M. J., E. A. Stuart, C. Frangakis and P. J. Leaf (2011): “Multiple imputation by chained equations: what is it and how does it work?” Int. J. Methods Psychiatr. Res., 20, 40–49.10.1002/mpr.329Search in Google Scholar PubMed
Bastien, P. and M. Tenenhaus (2003): “PLS regression and multiple imputation.” In: Proceedings of the PLS’03 International Symposium, Vilares, M, Tenenhaus, M, Coelho, P & Esposito Vinzi, V editors CISIA Paris. pp. 497–498.Search in Google Scholar
Bertrand, F., N. Meyer and M. Maumy-Bertrand (2014): plsRglm: partial least squares regression for generalized linear models, book of abstracts, User2014!, Los Angeles. R package version 1.2.5.Search in Google Scholar
Bodner, T. E. (2008): “What improves with increased missing data imputations?” Structur. Equ. Modeling, 15, 651–675.10.1080/10705510802339072Search in Google Scholar
Burnham, A. J., R. Viveros and J. F. Macgregor (1996): “Frameworks for latent variable multivariate regression,” J. Chemom., 10, 31–45.10.1002/(SICI)1099-128X(199601)10:1<31::AID-CEM398>3.0.CO;2-1Search in Google Scholar
Burnham, A. J., J. F. Macgregor and R. Viveros (1999): “Latent variable multivariate regression modeling,” Chemom. Intell. Lab. Syst., 48, 167–180.10.1016/S0169-7439(99)00018-0Search in Google Scholar
De Jong, S. (1993): “SIMPLS: an alternative approach squares regression to partial least,” Chemom. Intell. Lab. Syst., 18, 251–263.10.1016/0169-7439(93)85002-XSearch in Google Scholar
Dixon, J. K. (1979): “Pattern recognition with partly missing data,” IEEE Trans. Syst. Man Cybern., 10, 617–621.10.1109/TSMC.1979.4310090Search in Google Scholar
Eastment, H. T. and W. J. Krzanowski (1982): “Cross-validatory choice of the number of components from a principal component analysis,” Technometrics, 24, 73–77.10.1080/00401706.1982.10487712Search in Google Scholar
Eriksson, I., E. Johansson, N. Kettaneh-Wold and S. Wold (2002): “Multi- and megavariate data analysis, principles and applications,” J. Chemom., 16, 261–262.Search in Google Scholar
Folch-Fortuny, A., F. Arteaga and A. Ferrer (2016): “Missing data imputation toolbox for MATLAB,” Chemom. Intell. Lab. Syst., 154, 93–100.10.1016/j.chemolab.2016.03.019Search in Google Scholar
Goicoechea, H. C. and A. C. Olivieri (1999a): “Determination of bromhexine in cough-cold syrups by absorption spectrophotometry and multivariate calibration using partial least-squares and hybrid linear analyses. Application of a novel method of wavelength selection,” Talanta, 49, 793–800.10.1016/S0039-9140(99)00080-6Search in Google Scholar
Goicoechea, H. C. and A. C. Olivieri (1999b): “Enhanced synchronous spectrofluorometric determination of tetracycline in blood serum by chemometric analysis. Comparison of partial least-squares and hybrid linear analysis calibrations,” Anal. Chem., 71, 4361–4368.10.1021/ac990374eSearch in Google Scholar
Goicoechea, H. C. and A. C. Olivieri (2003): “A new family of genetic algorithms for wavelength interval selection in multivariate analytical spectroscopy,” J. Chemom., 17, 338–345.10.1002/cem.812Search in Google Scholar
Graham, J. W., A. E. Olchowski and T. D. Gilreath (2007): “How many imputations are really needed? Some practical clarifications of multiple imputation theory,” Prev. Sci., 8, 206–213.10.1007/s11121-007-0070-9Search in Google Scholar PubMed
Grung, B. and R. Manne (1998): “Missing values in principal component analysis,” Chemom. Intell. Lab. Syst., 42, 125–139.10.1016/S0169-7439(98)00031-8Search in Google Scholar
Horton, N. J. and S. R. Lipsitz (2001): “Multiple imputation in practice: Comparison of software packages for regression models with missing variables,” Am. Stat., 55, 244–254.10.1198/000313001317098266Search in Google Scholar
Höskuldsson, A. (1988): “PLS regression,” J. Chemom., 2, 211–228.10.1002/cem.1180020306Search in Google Scholar
Kowarik, A. and M. Templ (2016): “Imputation with the R package VIM,” J. Stat. Softw., 74, 1–16.10.18637/jss.v074.i07Search in Google Scholar
Krämer, N. and M. L. Braun (2015): plsdof: degrees of freedom and statistical inference for partial least squares regression. R package version 0.2-9.Search in Google Scholar
Krämer, N. and M. Sugiyama (2012): “The degrees of freedom of partial least squares regression,” J. Am. Stat. Assoc., 106, 697–705.10.1198/jasa.2011.tm10107Search in Google Scholar
Kvalheim, O. (1992): “The latent variable,” Chemom. Intell. Lab. Syst., 14, 1–3.10.1016/0169-7439(92)80088-LSearch in Google Scholar
Lazraq, A., R. Cléroux and J.-P. Gauchi (2003): “Selecting both latent and explanatory variables in the PLS1 regression model,” Chemom. Intell. Lab. Syst., 66, 117–126.10.1016/S0169-7439(03)00027-3Search in Google Scholar
Leisch, F. and E. Dimitriadou (2010): mlbench: Machine Learning Benchmark Problems. R package version 2.1-1.Search in Google Scholar
Li, B., J. Morris and E. B. Martin (2002): “Model selection for partial least squares regression,” Chemome. Intell. Lab. Syst., 64, 79–89.10.1016/S0169-7439(02)00051-5Search in Google Scholar
Little, R. J. and D. B. Rubin (1987): Statistical analysis with missing data,Wiley, New York, Wiley Series in Probability and Statistics – Applied Probability and Statistics Series.Search in Google Scholar
Little, R. J. and D. B. Rubin (2002): Statistical analysis with missing data, A John Wiley & Sons, Inc., New York, 2nd edition.10.1002/9781119013563Search in Google Scholar
Meyer, N., M. Maumy-Bertrand and F. Bertrand (2010): “Comparaison de variantes de régressions logistiques PLS et de régression PLS sur variables qualitatives: application aux données d’allélotypage,” J. Soc. Stat. Paris., 151, 1–18.Search in Google Scholar
Nelson, P. R., P. A. Taylor and J. F. MacGregor (1996): “Missing data methods in PCA and PLS: score calculations with incomplete observations,” Chemom. Intell. Lab. Syst., 35, 45–65.10.1016/S0169-7439(96)00007-XSearch in Google Scholar
Nguyen, D. V. and D. M. Rocke (2004): “On partial least squares dimension reduction for microarray-based classification: a simulation study,” Comput. Stat. Data An., 46, 407–425.10.1016/j.csda.2003.08.001Search in Google Scholar
Oleszko, A., J. Hartwich, A. Wójtowicz, M. Ga̧sior-Głogowska, H. Huras and M. Komorowska (2017): “Comparison of FTIR-ATR and Raman spectroscopy in determination of VLDL triglycerides in blood serum with PLS regression,” Spectrochim. Acta A Mol. Biomol. Spectrosc., 183, 239–246.10.1016/j.saa.2017.04.020Search in Google Scholar PubMed
Pérez-Enciso, M. and M. Tenenhaus (2003): “Prediction of clinical outcome with microarray data: a partial least squares discriminant analysis (PLS-DA) approach Received,” Hum. Genet., 112, 581–592.Search in Google Scholar
Perry, P. O. (2015): bcv: Cross-validation for the SVD (Bi-cross-validation): R package version 1.0.1.Search in Google Scholar
Rännar, S., P. Geladi, F. Lindgren and S. Wold (1995): “A PLS Kernel algorithm for data sets with many variables and few objects. 2. Cross-validataion, missing data and examples,” J. Chemom., 9, 459–470.10.1002/cem.1180090604Search in Google Scholar
Rosipal, R. and N. Krämer (2005): “Overview and recent advances in partial least squares.” In: Subspace, Latent Structure and Feature Selection, Statistical and Optimization, pp. 34–51.10.1007/11752790_2Search in Google Scholar
Royston, P. (2004): “Multiple imputation of missing values,” Stata J., 4, 227–241.10.1177/1536867X0400400301Search in Google Scholar
Rubin, D. B. (1987): Multiple imputation for nonresponse in surveys, John Wiley & Son, New York, New York.10.1002/9780470316696Search in Google Scholar
Rubin, D. B. (1996): “Multiple imputation after 18+ years,” J. Am. Stat. Assoc., 91, 473–489.10.1080/01621459.1996.10476908Search in Google Scholar
Sawatsky, M. L., M. Clyde and F. Meek (2015): “Partial least squares regression in the social sciences,” Quant. Method Psychol., 11, 52–62.10.20982/tqmp.11.2.p052Search in Google Scholar
Schwarz, G. (1978): “Estimating the dimension of a model,” Ann. Stat., 6, 461–464.10.1214/aos/1176344136Search in Google Scholar
Serneels, S. and T. Verdonck (2008): “Principal component regression for data containing outliers and missing elements,” Comput. Stat. Data An., 52, 1712–1727.10.1016/j.csda.2007.05.024Search in Google Scholar
Stone, M. (1974): “Cross-validatory choice and assessment of statistical predictions,” J. R. Stat. Soc., 36, 111–147.Search in Google Scholar
Templ, M., A. Alfons, A. Kowarik and B. Prantner (2017): VIM: visualization and imputation of missing values. R package version 4.8.0.Search in Google Scholar
Tenenhaus, M. (1998): La Régression PLS: théorie et pratique, Editions Technip.Search in Google Scholar
Troyanskaya, O., M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein and R. B. Altman. (2001): “Missing value estimation methods for DNA microarrays,” Bioinformatics, 17, 520–525.10.1093/bioinformatics/17.6.520Search in Google Scholar PubMed
Van Buuren, S. (2007): “Multiple imputation of discrete and continuous data by fully conditional specification,” Stat. Methods Med. Res., 16, 219–242.10.1177/0962280206074463Search in Google Scholar PubMed
Van Buuren, S. (2012): Flexible imputation of missing data, Chapman & Hall/CRC, Boca Raton.10.1201/b11826Search in Google Scholar
Van Buuren, S. (2018): mice: Multivariate imputation by chained equations. R package version 3.3.0.Search in Google Scholar
Van Buuren, S. and K. Groothuis-Oudshoorn (2011): mice: Multivariate imputation by chained equation in R,” J. Stat. Softw., 45.Search in Google Scholar
Wakeling, I. N. and J. J. Morris (1993): “A test of significance for partial least squares regression,” J. Chemom., 7, 291–304.10.1002/cem.1180070407Search in Google Scholar
White, I. R., P. Royston and A. M. Wood (2011): “Multiple imputation using chained equations: issues and guidance for practice,” Stat. Med., 30, 377–399.10.1002/sim.4067Search in Google Scholar PubMed
Wiklund, S., D. Nilsson, L. Eriksson, M. Sjöström, S. Wold and K. Faber (2007): “A randomization test for PLS component selection,” J. Chemom., 21, 427–439.10.1002/cem.1086Search in Google Scholar
Wold, H. (1966): Estimation of principal components and related models by iterative least squares, volume 1. Academic Press, New York.Search in Google Scholar
Wold, S., K. Esbensen and P. Geladi (1987): “Principal component analysis,” Chemom. Intell. Lab. Syst., 2, 37–52.10.1016/0169-7439(87)80084-9Search in Google Scholar
Wold, S., M. Sjöström and L. Eriksson (2001): “PLS-regression: a basic tool of chemometrics,” Chemom. Intell. Lab. Syst., 58, 109–130.10.1016/S0169-7439(01)00155-1Search in Google Scholar
Yang, T. C., L. S. Aucott, G. G. Duthie and H. M. Macdonald (2017): “An application of partial least squares for identifying dietary patterns in bone health,” Arch. osteoporosis, 12, 63.10.1007/s11657-017-0355-ySearch in Google Scholar PubMed PubMed Central
© 2019 Walter de Gruyter GmbH, Berlin/Boston