Abstract
A statistical method is likely to be sub-optimal if the assumed model does not reflect the structure of the data at hand. For this reason, it is important to perform model selection before statistical analysis. However, selecting an appropriate model from a large candidate pool is usually computationally infeasible when faced with a massive data set, and little work has been done to study data selection for model selection. In this work, we propose a subdata selection method based on leverage scores which enables us to conduct the selection task on a small subdata set. Compared with existing subsampling methods, our method not only improves the probability of selecting the best model but also enhances the estimation efficiency. We justify this both theoretically and numerically. Several examples are presented to illustrate the proposed method.
Similar content being viewed by others
References
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19:716–723
Atkinson AC, Fedorov VV (1975) The design of experiments for discriminating between two rival models. Biometrika 62:57–70
Bingham DR, Chipman HA (2007) Incorporating prior information in optimal design for model selection. Technometrics 49:155–163
Boivin J, Ng S (2006) Are more data always better for factor analysis? J Econom 132:169–194
Box GEP, Hill WJ (1967) Discrimination among mechanistic models. Technometrics 9:57–71
Candes E, Tao T et al (2007) The Dantzig selector: statistical estimation when p is much larger than n. Ann Stat 35:2313–2351
Chakrabortty A, Cai T (2018) Efficient and adaptive linear regression in semi-supervised settings. Ann Stat 46:1541–1572
Chen WY, Mackey L, Gorham J, Briol FX, Oates C (2018) Stein points. In: Dy J, Krause A (eds) Proceedings of the 35th international conference on machine learning, vol 80, pp 844–853
Chipman HA, Hamada MS (1996) Discussion: factor-based or effect-based modeling? implications for design. Technometrics 38:317–320
Claeskens G, Hjort NL (2008) Model selection and model averaging. Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press, Cambridge
Consonni G, Deldossi L (2016) Objective Bayesian model discrimination in follow-up experimental designs. TEST 25:397–412
Deldossi L, Tommasi C (2021) Optimal design subsampling from big datasets. J Qual Technol. In press
Dereziński M, Warmuth MK (2018) Reverse iterative volume sampling for linear regression. J Mach Learn Res 19:1–39
Dette H, Titoff S (2009) Optimal discrimination designs. Ann Stat 37:2056–2082
Dette H, Melas VB, Guchenko R (2015) Bayesian T-optimal discriminating designs. Ann Stat 43:1959–1985
Drineas P, Kannan R, Mahoney MW (2006) Fast Monte Carlo algorithms for matrices I: approximating matrix multiplication. SIAM J Comput 36:132–157
Drineas P, Mahoney MW, Muthukrishnan S, Sarlós T (2011) Faster least squares approximation. Numerische Mathematik 117:219–249
Drovandi CC, McGree JM, Pettitt AN (2014) A sequential Monte Carlo algorithm to incorporate model uncertainty in Bayesian sequential design. J Comput Gr Stat 23:3–24
Efron B, Hastie T, Johnstone I, Tibshirani R et al (2004) Least angle regression. Ann Stat 32:407–499
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360
Fang KT, Kotz S, Ng KW (1990) Symmetric multivariate and related distributions. Monographs on Statistics and Applied Probability. Springer, Berlin
Fithian W, Hastie T (2014) Local case-control sampling: efficient subsampling in imbalanced data sets. Ann Stat 42:1693–1724
Hastie T, Tibshirani R (1993) Varying-coefficient models. J R Stat Soc: Ser B 55:757–779
Hastie TJ, Tibshirani RJ (1990) Generalized additive models, vol 43. CRC Press, Boca Raton
Joseph VR, Wang D, Gu L, Lyu S, Tuo R (2019) Deterministic sampling of expensive posteriors using minimum energy designs. Technometrics 61:297–308
Kadane JB, Lazar NA (2004) Methods and criteria for model selection. J Am Stat Assoc 99:279–290
Kleiner A, Talwalkar A, Sarkar P, Jordan MI (2015) A scalable bootstrap for massive data. J R Stat Soc: Ser B 76:795–816
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22:79–86
Lee S, Ng S (2020) An econometric perspective on algorithmic subsampling. Annu Rev Econ 12:45–80
Leng C, Leung DHY (2011) Model selection in validation sampling: an asymptotic likelihood-based lasso approach. Stat Sin 21:659–678
Li T, Meng C (2021) Modern subsampling methods for large-scale least squares regression. arXiv preprint arXiv:210501552
Lindley DV (1956) On a measure of the information provided by an experiment. Ann Math Stat 27:986–1005
López-Fidalgo J, Tommasi C, Trandafir PC (2007) An optimal experimental design criterion for discriminating between non-normal models. J R Stat Soc: Ser B 69:231–242
Ma P, Mahoney MW, Yu B (2015) A statistical perspective on algorithmic leveraging. J Mach Learn Res 16:861–919
Ma P, Zhang X, Xing X, Ma J, Mahoney MW (2020) Asymptotic analysis of sampling estimators for randomized numerical linear algebra algorithms. arXiv preprint arXiv:200210526
Mahoney MW (2012) Randomized algorithms for matrices and data. Found Trends Mach Learn 3:647–672
Mak S, Joseph VR (2018) Support points. Ann Stat 46:2562–2592
Mamonov S, Triantoro T (2018) Subjectivity of diamond prices in online retail: insights from a data mining study. J Theor Appl Electron Commer Res 13:15–28
McCullagh P, Nelder JA (1989) Generalized linear models. Monographs on Statistics and Applied Probability, vol 37. Chapman & Hall
Meng X, Saunders MA, Mahoney MW (2014) LSRN: a parallel iterative solver for strongly over-or underdetermined systems. SIAM J Sci Comput 36:C95–C118
Meng C, Wang Y, Zhang X, Mandal A, Ma P, Zhong W (2017) Effective statistical methods for big data analytics. In: Handbook of Research on Applied Cybernetics and Systems Science, pp 280–299
Meng C, Xie R, Mandal A, Zhang X, Zhong W, Ma P (2020a) Lowcon: a design-based subsampling approach in a misspecified linear model. J Comput Gr Stat. In press
Meng C, Zhang X, Zhang J, Zhong W, Ma P (2020b) More efficient approximation of smoothing splines via space-filling basis selection. Biometrika 107:723–735
Meyer RD, Steinberg DM, Box G (1996) Follow-up designs to resolve confounding in multifactor experiments. Technometrics 38:303–313
Miller A (2002) Subset selection in regression. CRC Press, Boca Raton
Ng S (2017) Opportunities and challenges: lessons from analyzing terabytes of scanner data. Tech. rep., National Bureau of Economic Research
Papailiopoulos D, Kyrillidis A, Boutsidis C (2014) Provable deterministic leverage score sampling. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, pp 997–1006
Pukelsheim F (2006) Optimal design of experiments. Society for Industrial and Applied Mathematics
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464
Sebastiani P, Wynn HP (2000) Maximum entropy sampling and optimal Bayesian experimental design. J R Stat Soc: Ser B 62:145–157
Shao J (1997) An asymptotic theory for linear model selection. Stat Sin 7:221–264
Shewry MC, Wynn HP (1987) Maximum entropy sampling. J Appl Stat 14:165–170
Simon N, Friedman J, Hastie T, Tibshirani R (2011) Regularization paths for cox’s proportional hazards model via coordinate descent. J Stat Softw 39:1–13
Sin CY, White H (1996) Information criteria for selecting possibly misspecified parametric models. J Econom 71:207–225
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc: Ser B 58:267–288
Truong Y, Kooperberg C, Stone C, Hansen M (2005) Statistical modeling with spline functions: methodology and theory. Springer Series in Statistics, Springer, New York
van der Vaart A (1998) Asymptotic statistics. Cambridge University Press, Cambridge
Wang H (2019) More efficient estimation for logistic regression with optimal subsamples. J Mach Learn Res 20:1–59
Wang H, Zhu R, Ma P (2018) Optimal subsampling for large sample logistic regression. J Am Stat Assoc 113:829–844
Wang H, Yang M, Stufken J (2019) Information-based optimal subdata selection for big data linear regression. J Am Stat Assoc 114:393–405
Xu C, Chen J, Mantel H (2013) Pseudo-likelihood-based Bayesian information criterion for variable selection in survey data. Surv Methodol 39:303–321
Yang Y (2005) Can the strengths of AIC and BIC be shared? A conflict between model indentification and regression estimation. Biometrika 92:937–950
Yao Y, Wang H (2019) Optimal subsampling for softmax regression. Stat Pap 60:585–599
Yao Y, Wang H (2021) A selective review on statistical techniques for big data. In: Modern statistical methods for health research. Springer. In press
Yuan Z, Yang Y (2005) Combining linear regression models: when and how? J Am Stat Assoc 100:1202–1214
Zhang C (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38:894–942
Zhang T, Ning Y, Ruppert D (2020) Optimal sampling for generalized linear models under measurement constraints. J Comput Gr Stat. In press
Zheng C, Ferrari D, Yang Y (2019) Model selection confidence sets by likelihood ratio testing. Stat Sin 29:827–851
Acknowledgements
The authors sincerely thank the editor, associate editor, and referees for their valuable comments and insightful suggestions, which led to further improvement of this article. The authors are also grateful to professors Mingyao Ai and Ping Ma for helpful discussions. This work is supported by NSFC (Grant No. 12001042) and Beijing Institute of Technology Research Fund Program for Young Scholars and also supported by National Science Foundation (Grant No. 2105571).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
Technical details
Proof of Theorem 1
For any given subdata \(X^*\), by applying the entropy decomposition in information theory (Sebastiani and Wynn 2000, Equation (2)), the joint entropy of \(Y^*\) and the parameter \(\varTheta \) can be decomposed as
where \(\varvec{\varepsilon }^*\) stands for the corresponding error term of \((Y^*,X^*)\) in model (1). The second equality holds by the model assumption since all the randomness of \(Y^*\) comes from the error term conditional on \(\varTheta ,X^*\) and \(\varTheta \) is functionally independent of \(X^*\). This implies that \(\mathrm {Ent}(Y^*,\varTheta |X^*)\) is a constant up to the subdata size r.
Also note that \(\mathrm {Ent}(Y^*,\varTheta |X^*)\) can also be decomposed as
That is to say maximizing \(\mathrm {Ent}(Y^*|X^*)\) indicates minimizing the overall expected deviance loss \(E_{Y^*}\{\mathrm {Ent}(\varTheta |Y^*,X^*)\}\).
Now we turns to calculate \(\mathrm {Ent}(Y^*|X^*)\). Without loss of generality, we assume that the first \(p_{(t)}\) columns of X be the model matrix of the true model. Thus the prior of \(\beta _{(t)}\) comes from \(N(\beta _\mathrm{prior,(t)},\sigma _f^2I_{p_{(t)}})\), where \(\beta _\mathrm{prior,(t)}\) corresponds to the first \(p_{(t)}\) entries of \(\beta _\mathrm{prior}\). Note that \(Y^*=X_{(t)}^*\beta _{(t)}+\varvec{\varepsilon }\) and the prior of \(\beta _{(t)}\) obeys \(N(\beta _\mathrm{prior,(t)},\sigma _f^2I_{p_{(t)}})\). Thus the marginal distribution of \(Y^*\) is normal with mean \(X_{(t)}^*\beta _\mathrm{prior,(t)}\) and variance \(\sigma ^2I_t+\sigma _f^{2}X_{(t)}^{*{\mathrm {T} }}X_{(t)}^*\) under model (1). The desired results come from the facts
where \(c_1,c_2,c_3\) are some constant up to the subdata size r. The second equality comes from the matrix determinant lemma, i.e., \(\det (A+BC)=\det (A)\det (I+CA^{-1}B)\) for some matrices A, B, C with \(A>0\). \(\square \)
Proof of Theorem 2
It is sufficient to show that \(X^{*{\mathrm {T} }}X^*\le (\sum _{i=1}^n\delta _i h_{ii}) X^{{\mathrm {T} }}X\) in the sense of Loewner ordering. Let \({{x}}_i\) be the ith row of X. For any \( a\in \mathbb {R}^{p}\), noting that \(X^{\mathrm {T} }X\) is a full rank matrix, a can be represent as \( a=(X^{\mathrm {T} }X)^{-1/2} b\) for some \( b\in \mathbb {R}^{p}\). Then,
where \(\mathrm {tr}(\cdot )\) is the trace operator and \((X^{\mathrm {T} }X)^{-1/2}(X^{\mathrm {T} }X)^{-1/2}=(X^{\mathrm {T} }X)^{-1}\). Therefore, \( {{x}}_i {{x}}_i^{\mathrm {T} }\le h_{ii}X^{\mathrm {T} }X\) and the desired result comes from summing over the both side of the inequality. \(\square \)
For clarity, we begin with the proof of the following lemma since some results in the following lemma will be used in the proof of Theorem 3.
Lemma 1
Assume that \(n^{-1}X^TX\) goes to a positive definite matrix. Let \(\hat{\beta }_k^*\) be the MLE based on selected subdata set according to Algorithm 1 for the kth candidate model. As \(r\rightarrow \infty ,n\rightarrow \infty \), the following result holds:
Proof of Lemma 1
According to Algorithm 1, \(X^*=U_\varGamma \varSigma V^{\mathrm {T} }\). Then it is sufficient to show that
for some constant c, where \(\lambda _{\max }(A)\), \(\lambda _{\min }(A)\) stand for the maximum and minimum eigenvalue of A, respectively. Since \(U_\varGamma ^{\mathrm {T} }U_\varGamma \) is positive definite through Algorithm 1, therefore \(\lambda _{\max } (U_\varGamma ^{\mathrm {T} }U_\varGamma )\le \mathrm {tr}(U_\varGamma ^{\mathrm {T} }U_\varGamma )=\sum _{i=1}^{r}h_{(ii)}\). By the definition of the condition number, it holds that
where the last inequality comes from the fact that \(\mathrm {tr}(U_\varGamma ^{\mathrm {T} }U_\varGamma )\le p\lambda _{\max }(U_\varGamma ^{\mathrm {T} }U_\varGamma )\).
From (A.7), it follows that
and the desired results follows by noting \(X^TX=V\varSigma ^2 V^T\) and \(\mathrm{Var}(\hat{\beta }_k)=(P^TX^{*T}X^*P)^{-1}\) for the projection matrix P such that \(X_{(k)}=XP\) where \(X_{(k)}\) is the design matrix for model \(S_k\). \(\square \)
Now, let us turn to proof Theorem 3.
Proof of Theorem 3
Denote \(\mathcal {M}^C\) be the set of correct candidate models, and \(\mathcal {M}^I=\mathcal {M}-\mathcal {M}^C\) be the set of incorrect candidate models. We first show that
where \(\mu ^*\) stands for the mean of the selected data, \(H_{(k)}^*=X_{(k)}^*(X_{(k)}^{*{\mathrm {T} }}X_{(k)}^*)^{-1}X_{(k)}^{*{\mathrm {T} }}.\)
For any candidate model in \(\mathcal {M}^{I}\), say \(S_k\) as an example, let the model matrix for the closest correct model be \(\tilde{X}_{(k)}^*:=(X_{(\check{c})}^*,X_{(k)}^*)\). Here \(X^*_{(\check{c})}\) stands for the “complementary” design, which consists of the columns of \(X_{(t)}^*\) that are not included in \(X_{(k)}^*\). Denote the regression coefficient vector corresponding to \(X_{(\check{c})}^*\) as \(\beta _{(\check{c})}\), which is a subvector of \(\beta _{(t)}\). Direct calculation yields
Utilizing the results in (A.8), we have
for some constant \(c_2\). Note that the \(\tilde{X}_{(k)}\) is a submatrix of X up to a column permutation. Thus \(\lambda _{\min }(\tilde{X}_{(k)}^{*{\mathrm {T} }}\tilde{X}^*_{(k)})\ge \lambda _{\min }(\tilde{X}^{*{\mathrm {T} }}\tilde{X}^*)=O\left( n\sum _{i=1}^{r}h_{(ii)}\right) \), where \(\lambda _{\min }(\cdot )\) stands for the smallest eigenvalue of a squared matrix. From (A.10), we have
which implies (A.9) holds.
For convenience, let \({\text {BIC}^*}(S_k)=r\log \left( r^{-1}\sum _{i=1}^r(y_i^*-{\hat{\mu }}_i^{*(k)})^2\right) +(p_{(k)}+1)\log r.\) From (3.7) in Shao (1997), for any model \(S_k\) in \(\mathcal {M}^I\), we have
which implies \(\log (\sum _{i=1}^r(y_i^*-{\hat{\mu }}_i^{*(k)})^2)-\log (\sum _{i=1}^r(y_i^*-{\hat{\mu }}_i^{*(t)})^2)>r\log (1+p\log r/r)\) under the assumption \(\hat{\sigma }^*\not \rightarrow 0.\) Therefore,
Similarly, for any model \(S_{k^\prime }\) in \(\mathcal {M}^C\) with \(p_{(k^\prime )}>p_{(t)},\) where \(p_{(t)}\) is the column dimension of \(X_{(t)}\), it is straightforward to see that
according to the log likelihood ratio test (see van der Vaart 1998, Chapter 16). Therefore, it holds that
Combining (A.14) and (A.16), we can get the desired result. \(\square \)
Proof of Theorem 4
This is the direct result from Lemma 1.
Additional simulation results on forward regression
Since the number of possible models increases exponentially with p, all-subset regression is only feasible for the cases that p is relatively small. Alternatively, a forward selection approach is usually adopted. More precisely, the forward regression starts from the null model, and iteratively adds one variable to the currently “best” model which yields the lowest value for the BIC at a time. This process is repeated until no more variables should be added into the currently “best” model. In this part, we adapt the forward regression to illustrate the proposed method. Of course, a backward elimination procedure and a step-wise regression procedure can also be adopted. Since the three methods have similar performance, we only report the results on forward regression.
In accordance with Sect. 5.1, we also demonstrate our method as well as the uniform subsampling, leveraging score subsampling and IBOSS through the six cases. The selection accuracies for the six cases are presented in Figure 5. The log MSPEs are also provided in Figure 6 to evaluate the performance of prediction.
From Figures 5, and 6, one can see that the forward regression results based on BIC are very similar to the results on the all-subset regressions on BIC.
Additional simulation results on Lasso
Now, we will exam our method’s performance on model selection according to Lasso (Tibshirani 1996).
To be aligned with the settings described in Sect. 5.1, we also demonstrate our methods as well as the other three methods (i.e., uniform subsampling, leverage score subsampling, and the IBOSS) on the six cases listed at the beginning of Sect. 5.1 and evaluate the selection performance through the selection accuracies and MSPEs. The Lasso method is conducted through the glmnet package (Simon et al. 2011) and the tuning parameters are selected through 10-fold cross-validation according to the cv.glmnet() function. As for the leverage score subsampling, the Lasso is conduct as in Leng and Leung (2011).
Results on the selection accuracies are presented in Figure 7. It can be seen that the selection results based on Lasso are very similar to the results on the subset selection based on BIC.
To see the benefits of the model selection, we also report the log MSPEs in Figure 8 with \(n_\mathrm{test}=500.\) From Figure 8, we can clearly see that the IBOSS and our method (LEVSS) are uniformly performance better than the uniform subsampling method.
Rights and permissions
About this article
Cite this article
Yu, J., Wang, H. Subdata selection algorithm for linear model discrimination. Stat Papers 63, 1883–1906 (2022). https://doi.org/10.1007/s00362-022-01299-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00362-022-01299-8
Keywords
- Bayesian information criterion
- Big data
- Discrimination design
- D-optimal design
- Entropy
- Measurement constraints