Determining the number of components in PLS regression on incomplete data set

Titin Agustin Nengsih; Frédéric Bertrand; Myriam Maumy-Bertrand; Nicolas Meyer

doi:10.1515/sagmb-2018-0059

Published by De Gruyter November 6, 2019

Determining the number of components in PLS regression on incomplete data set

Titin Agustin Nengsih , Frédéric Bertrand , Myriam Maumy-Bertrand and Nicolas Meyer

From the journal Statistical Applications in Genetics and Molecular Biology

https://doi.org/10.1515/sagmb-2018-0059

Showing a limited preview of this publication:

Abstract

Partial least squares regression – or PLS regression – is a multivariate method in which the model parameters are estimated using either the SIMPLS or NIPALS algorithm. PLS regression has been extensively used in applied research because of its effectiveness in analyzing relationships between an outcome and one or several components. Note that the NIPALS algorithm can provide estimates parameters on incomplete data. The selection of the number of components used to build a representative model in PLS regression is a central issue. However, how to deal with missing data when using PLS regression remains a matter of debate. Several approaches have been proposed in the literature, including the Q² criterion, and the AIC and BIC criteria. Here we study the behavior of the NIPALS algorithm when used to fit a PLS regression for various proportions of missing data and different types of missingness. We compare criteria to select the number of components for a PLS regression on incomplete data set and on imputed data set using three imputation methods: multiple imputation by chained equations, k-nearest neighbour imputation, and singular value decomposition imputation. We tested various criteria with different proportions of missing data (ranging from 5% to 50%) under different missingness assumptions. Q²-leave-one-out component selection methods gave more reliable results than AIC and BIC-based ones.

Keywords: imputation method; missing data; NIPALS; number of components; PLS regression

MSC 2010: 62G08; 68U20; 65C60

References

Akaike, H. (1969): “Fitting autoregressive models for prediction,” Ann. Ins. Stat. Math., 21, 243–247.10.1007/BF02532251Search in Google Scholar

Arteaga, F. and A. Ferrer (2002): “Dealing with missing data in MSPC: Several methods, different interpretations, some examples,” J. Chemom., 16, 408–418.10.1002/cem.750Search in Google Scholar

Azur, M. J., E. A. Stuart, C. Frangakis and P. J. Leaf (2011): “Multiple imputation by chained equations: what is it and how does it work?” Int. J. Methods Psychiatr. Res., 20, 40–49.10.1002/mpr.329Search in Google Scholar PubMed

Bastien, P. and M. Tenenhaus (2003): “PLS regression and multiple imputation.” In: Proceedings of the PLS’03 International Symposium, Vilares, M, Tenenhaus, M, Coelho, P & Esposito Vinzi, V editors CISIA Paris. pp. 497–498.Search in Google Scholar

Bertrand, F., N. Meyer and M. Maumy-Bertrand (2014): plsRglm: partial least squares regression for generalized linear models, book of abstracts, User2014!, Los Angeles. R package version 1.2.5.Search in Google Scholar

Bodner, T. E. (2008): “What improves with increased missing data imputations?” Structur. Equ. Modeling, 15, 651–675.10.1080/10705510802339072Search in Google Scholar

Burnham, A. J., R. Viveros and J. F. Macgregor (1996): “Frameworks for latent variable multivariate regression,” J. Chemom., 10, 31–45.10.1002/(SICI)1099-128X(199601)10:1<31::AID-CEM398>3.0.CO;2-1Search in Google Scholar

Burnham, A. J., J. F. Macgregor and R. Viveros (1999): “Latent variable multivariate regression modeling,” Chemom. Intell. Lab. Syst., 48, 167–180.10.1016/S0169-7439(99)00018-0Search in Google Scholar

De Jong, S. (1993): “SIMPLS: an alternative approach squares regression to partial least,” Chemom. Intell. Lab. Syst., 18, 251–263.10.1016/0169-7439(93)85002-XSearch in Google Scholar

Dixon, J. K. (1979): “Pattern recognition with partly missing data,” IEEE Trans. Syst. Man Cybern., 10, 617–621.10.1109/TSMC.1979.4310090Search in Google Scholar

Eastment, H. T. and W. J. Krzanowski (1982): “Cross-validatory choice of the number of components from a principal component analysis,” Technometrics, 24, 73–77.10.1080/00401706.1982.10487712Search in Google Scholar

Eriksson, I., E. Johansson, N. Kettaneh-Wold and S. Wold (2002): “Multi- and megavariate data analysis, principles and applications,” J. Chemom., 16, 261–262.Search in Google Scholar

Folch-Fortuny, A., F. Arteaga and A. Ferrer (2016): “Missing data imputation toolbox for MATLAB,” Chemom. Intell. Lab. Syst., 154, 93–100.10.1016/j.chemolab.2016.03.019Search in Google Scholar

Goicoechea, H. C. and A. C. Olivieri (1999a): “Determination of bromhexine in cough-cold syrups by absorption spectrophotometry and multivariate calibration using partial least-squares and hybrid linear analyses. Application of a novel method of wavelength selection,” Talanta, 49, 793–800.10.1016/S0039-9140(99)00080-6Search in Google Scholar

Goicoechea, H. C. and A. C. Olivieri (1999b): “Enhanced synchronous spectrofluorometric determination of tetracycline in blood serum by chemometric analysis. Comparison of partial least-squares and hybrid linear analysis calibrations,” Anal. Chem., 71, 4361–4368.10.1021/ac990374eSearch in Google Scholar

Goicoechea, H. C. and A. C. Olivieri (2003): “A new family of genetic algorithms for wavelength interval selection in multivariate analytical spectroscopy,” J. Chemom., 17, 338–345.10.1002/cem.812Search in Google Scholar

Graham, J. W., A. E. Olchowski and T. D. Gilreath (2007): “How many imputations are really needed? Some practical clarifications of multiple imputation theory,” Prev. Sci., 8, 206–213.10.1007/s11121-007-0070-9Search in Google Scholar PubMed

Grung, B. and R. Manne (1998): “Missing values in principal component analysis,” Chemom. Intell. Lab. Syst., 42, 125–139.10.1016/S0169-7439(98)00031-8Search in Google Scholar

Horton, N. J. and S. R. Lipsitz (2001): “Multiple imputation in practice: Comparison of software packages for regression models with missing variables,” Am. Stat., 55, 244–254.10.1198/000313001317098266Search in Google Scholar

Höskuldsson, A. (1988): “PLS regression,” J. Chemom., 2, 211–228.10.1002/cem.1180020306Search in Google Scholar

Kowarik, A. and M. Templ (2016): “Imputation with the R package VIM,” J. Stat. Softw., 74, 1–16.10.18637/jss.v074.i07Search in Google Scholar

Krämer, N. and M. L. Braun (2015): plsdof: degrees of freedom and statistical inference for partial least squares regression. R package version 0.2-9.Search in Google Scholar

Krämer, N. and M. Sugiyama (2012): “The degrees of freedom of partial least squares regression,” J. Am. Stat. Assoc., 106, 697–705.10.1198/jasa.2011.tm10107Search in Google Scholar

Kvalheim, O. (1992): “The latent variable,” Chemom. Intell. Lab. Syst., 14, 1–3.10.1016/0169-7439(92)80088-LSearch in Google Scholar

Lazraq, A., R. Cléroux and J.-P. Gauchi (2003): “Selecting both latent and explanatory variables in the PLS1 regression model,” Chemom. Intell. Lab. Syst., 66, 117–126.10.1016/S0169-7439(03)00027-3Search in Google Scholar

Leisch, F. and E. Dimitriadou (2010): mlbench: Machine Learning Benchmark Problems. R package version 2.1-1.Search in Google Scholar

Li, B., J. Morris and E. B. Martin (2002): “Model selection for partial least squares regression,” Chemome. Intell. Lab. Syst., 64, 79–89.10.1016/S0169-7439(02)00051-5Search in Google Scholar

Little, R. J. and D. B. Rubin (1987): Statistical analysis with missing data,Wiley, New York, Wiley Series in Probability and Statistics – Applied Probability and Statistics Series.Search in Google Scholar

Little, R. J. and D. B. Rubin (2002): Statistical analysis with missing data, A John Wiley & Sons, Inc., New York, 2nd edition.10.1002/9781119013563Search in Google Scholar

Meyer, N., M. Maumy-Bertrand and F. Bertrand (2010): “Comparaison de variantes de régressions logistiques PLS et de régression PLS sur variables qualitatives: application aux données d’allélotypage,” J. Soc. Stat. Paris., 151, 1–18.Search in Google Scholar

Nelson, P. R., P. A. Taylor and J. F. MacGregor (1996): “Missing data methods in PCA and PLS: score calculations with incomplete observations,” Chemom. Intell. Lab. Syst., 35, 45–65.10.1016/S0169-7439(96)00007-XSearch in Google Scholar

Nguyen, D. V. and D. M. Rocke (2004): “On partial least squares dimension reduction for microarray-based classification: a simulation study,” Comput. Stat. Data An., 46, 407–425.10.1016/j.csda.2003.08.001Search in Google Scholar

Oleszko, A., J. Hartwich, A. Wójtowicz, M. Ga̧sior-Głogowska, H. Huras and M. Komorowska (2017): “Comparison of FTIR-ATR and Raman spectroscopy in determination of VLDL triglycerides in blood serum with PLS regression,” Spectrochim. Acta A Mol. Biomol. Spectrosc., 183, 239–246.10.1016/j.saa.2017.04.020Search in Google Scholar PubMed

Pérez-Enciso, M. and M. Tenenhaus (2003): “Prediction of clinical outcome with microarray data: a partial least squares discriminant analysis (PLS-DA) approach Received,” Hum. Genet., 112, 581–592.Search in Google Scholar

Perry, P. O. (2015): bcv: Cross-validation for the SVD (Bi-cross-validation): R package version 1.0.1.Search in Google Scholar

Rännar, S., P. Geladi, F. Lindgren and S. Wold (1995): “A PLS Kernel algorithm for data sets with many variables and few objects. 2. Cross-validataion, missing data and examples,” J. Chemom., 9, 459–470.10.1002/cem.1180090604Search in Google Scholar

Rosipal, R. and N. Krämer (2005): “Overview and recent advances in partial least squares.” In: Subspace, Latent Structure and Feature Selection, Statistical and Optimization, pp. 34–51.10.1007/11752790_2Search in Google Scholar

Royston, P. (2004): “Multiple imputation of missing values,” Stata J., 4, 227–241.10.1177/1536867X0400400301Search in Google Scholar

Rubin, D. B. (1987): Multiple imputation for nonresponse in surveys, John Wiley & Son, New York, New York.10.1002/9780470316696Search in Google Scholar

Rubin, D. B. (1996): “Multiple imputation after 18+ years,” J. Am. Stat. Assoc., 91, 473–489.10.1080/01621459.1996.10476908Search in Google Scholar

Sawatsky, M. L., M. Clyde and F. Meek (2015): “Partial least squares regression in the social sciences,” Quant. Method Psychol., 11, 52–62.10.20982/tqmp.11.2.p052Search in Google Scholar

Schwarz, G. (1978): “Estimating the dimension of a model,” Ann. Stat., 6, 461–464.10.1214/aos/1176344136Search in Google Scholar

Serneels, S. and T. Verdonck (2008): “Principal component regression for data containing outliers and missing elements,” Comput. Stat. Data An., 52, 1712–1727.10.1016/j.csda.2007.05.024Search in Google Scholar

Stone, M. (1974): “Cross-validatory choice and assessment of statistical predictions,” J. R. Stat. Soc., 36, 111–147.Search in Google Scholar

Templ, M., A. Alfons, A. Kowarik and B. Prantner (2017): VIM: visualization and imputation of missing values. R package version 4.8.0.Search in Google Scholar

Tenenhaus, M. (1998): La Régression PLS: théorie et pratique, Editions Technip.Search in Google Scholar

Troyanskaya, O., M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein and R. B. Altman. (2001): “Missing value estimation methods for DNA microarrays,” Bioinformatics, 17, 520–525.10.1093/bioinformatics/17.6.520Search in Google Scholar PubMed

Van Buuren, S. (2007): “Multiple imputation of discrete and continuous data by fully conditional specification,” Stat. Methods Med. Res., 16, 219–242.10.1177/0962280206074463Search in Google Scholar PubMed

Van Buuren, S. (2012): Flexible imputation of missing data, Chapman & Hall/CRC, Boca Raton.10.1201/b11826Search in Google Scholar

Van Buuren, S. (2018): mice: Multivariate imputation by chained equations. R package version 3.3.0.Search in Google Scholar

Van Buuren, S. and K. Groothuis-Oudshoorn (2011): mice: Multivariate imputation by chained equation in R,” J. Stat. Softw., 45.Search in Google Scholar

Wakeling, I. N. and J. J. Morris (1993): “A test of significance for partial least squares regression,” J. Chemom., 7, 291–304.10.1002/cem.1180070407Search in Google Scholar

White, I. R., P. Royston and A. M. Wood (2011): “Multiple imputation using chained equations: issues and guidance for practice,” Stat. Med., 30, 377–399.10.1002/sim.4067Search in Google Scholar PubMed

Wiklund, S., D. Nilsson, L. Eriksson, M. Sjöström, S. Wold and K. Faber (2007): “A randomization test for PLS component selection,” J. Chemom., 21, 427–439.10.1002/cem.1086Search in Google Scholar

Wold, H. (1966): Estimation of principal components and related models by iterative least squares, volume 1. Academic Press, New York.Search in Google Scholar

Wold, S., K. Esbensen and P. Geladi (1987): “Principal component analysis,” Chemom. Intell. Lab. Syst., 2, 37–52.10.1016/0169-7439(87)80084-9Search in Google Scholar

Wold, S., M. Sjöström and L. Eriksson (2001): “PLS-regression: a basic tool of chemometrics,” Chemom. Intell. Lab. Syst., 58, 109–130.10.1016/S0169-7439(01)00155-1Search in Google Scholar

Yang, T. C., L. S. Aucott, G. G. Duthie and H. M. Macdonald (2017): “An application of partial least squares for identifying dietary patterns in bone health,” Arch. osteoporosis, 12, 63.10.1007/s11657-017-0355-ySearch in Google Scholar PubMed PubMed Central

Published Online: 2019-11-06

Determining the number of components in PLS regression on incomplete data set

Abstract

References

Journal and Issue

Articles in the same Issue