Subdata selection algorithm for linear model discrimination

Yu, Jun; Wang, HaiYing

doi:10.1007/s00362-022-01299-8

Subdata selection algorithm for linear model discrimination

Regular Article
Published: 03 March 2022

Volume 63, pages 1883–1906, (2022)
Cite this article

Statistical Papers Aims and scope Submit manuscript

740 Accesses
10 Citations
1 Altmetric
Explore all metrics

Abstract

A statistical method is likely to be sub-optimal if the assumed model does not reflect the structure of the data at hand. For this reason, it is important to perform model selection before statistical analysis. However, selecting an appropriate model from a large candidate pool is usually computationally infeasible when faced with a massive data set, and little work has been done to study data selection for model selection. In this work, we propose a subdata selection method based on leverage scores which enables us to conduct the selection task on a small subdata set. Compared with existing subsampling methods, our method not only improves the probability of selecting the best model but also enhances the estimation efficiency. We justify this both theoretically and numerically. Several examples are presented to illustrate the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Article 30 August 2016

A survey of Bayesian Network structure learning

Article Open access 17 January 2023

References

Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19:716–723
Article MathSciNet MATH Google Scholar
Atkinson AC, Fedorov VV (1975) The design of experiments for discriminating between two rival models. Biometrika 62:57–70
Article MathSciNet MATH Google Scholar
Bingham DR, Chipman HA (2007) Incorporating prior information in optimal design for model selection. Technometrics 49:155–163
Article MathSciNet Google Scholar
Boivin J, Ng S (2006) Are more data always better for factor analysis? J Econom 132:169–194
Article MathSciNet MATH Google Scholar
Box GEP, Hill WJ (1967) Discrimination among mechanistic models. Technometrics 9:57–71
Article MathSciNet Google Scholar
Candes E, Tao T et al (2007) The Dantzig selector: statistical estimation when p is much larger than n. Ann Stat 35:2313–2351
MathSciNet MATH Google Scholar
Chakrabortty A, Cai T (2018) Efficient and adaptive linear regression in semi-supervised settings. Ann Stat 46:1541–1572
Article MathSciNet MATH Google Scholar
Chen WY, Mackey L, Gorham J, Briol FX, Oates C (2018) Stein points. In: Dy J, Krause A (eds) Proceedings of the 35th international conference on machine learning, vol 80, pp 844–853
Chipman HA, Hamada MS (1996) Discussion: factor-based or effect-based modeling? implications for design. Technometrics 38:317–320
Article Google Scholar
Claeskens G, Hjort NL (2008) Model selection and model averaging. Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press, Cambridge
Consonni G, Deldossi L (2016) Objective Bayesian model discrimination in follow-up experimental designs. TEST 25:397–412
Article MathSciNet MATH Google Scholar
Deldossi L, Tommasi C (2021) Optimal design subsampling from big datasets. J Qual Technol. In press
Dereziński M, Warmuth MK (2018) Reverse iterative volume sampling for linear regression. J Mach Learn Res 19:1–39
MathSciNet MATH Google Scholar
Dette H, Titoff S (2009) Optimal discrimination designs. Ann Stat 37:2056–2082
Article MathSciNet MATH Google Scholar
Dette H, Melas VB, Guchenko R (2015) Bayesian T-optimal discriminating designs. Ann Stat 43:1959–1985
Article MathSciNet MATH Google Scholar
Drineas P, Kannan R, Mahoney MW (2006) Fast Monte Carlo algorithms for matrices I: approximating matrix multiplication. SIAM J Comput 36:132–157
Article MathSciNet MATH Google Scholar
Drineas P, Mahoney MW, Muthukrishnan S, Sarlós T (2011) Faster least squares approximation. Numerische Mathematik 117:219–249
Article MathSciNet MATH Google Scholar
Drovandi CC, McGree JM, Pettitt AN (2014) A sequential Monte Carlo algorithm to incorporate model uncertainty in Bayesian sequential design. J Comput Gr Stat 23:3–24
Article MathSciNet Google Scholar
Efron B, Hastie T, Johnstone I, Tibshirani R et al (2004) Least angle regression. Ann Stat 32:407–499
Article MathSciNet MATH Google Scholar
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360
Article MathSciNet MATH Google Scholar
Fang KT, Kotz S, Ng KW (1990) Symmetric multivariate and related distributions. Monographs on Statistics and Applied Probability. Springer, Berlin
Fithian W, Hastie T (2014) Local case-control sampling: efficient subsampling in imbalanced data sets. Ann Stat 42:1693–1724
Article MathSciNet MATH Google Scholar
Hastie T, Tibshirani R (1993) Varying-coefficient models. J R Stat Soc: Ser B 55:757–779
MathSciNet MATH Google Scholar
Hastie TJ, Tibshirani RJ (1990) Generalized additive models, vol 43. CRC Press, Boca Raton
MATH Google Scholar
Joseph VR, Wang D, Gu L, Lyu S, Tuo R (2019) Deterministic sampling of expensive posteriors using minimum energy designs. Technometrics 61:297–308
Article MathSciNet Google Scholar
Kadane JB, Lazar NA (2004) Methods and criteria for model selection. J Am Stat Assoc 99:279–290
Article MathSciNet MATH Google Scholar
Kleiner A, Talwalkar A, Sarkar P, Jordan MI (2015) A scalable bootstrap for massive data. J R Stat Soc: Ser B 76:795–816
Article MathSciNet MATH Google Scholar
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22:79–86
Article MathSciNet MATH Google Scholar
Lee S, Ng S (2020) An econometric perspective on algorithmic subsampling. Annu Rev Econ 12:45–80
Article Google Scholar
Leng C, Leung DHY (2011) Model selection in validation sampling: an asymptotic likelihood-based lasso approach. Stat Sin 21:659–678
Article MathSciNet MATH Google Scholar
Li T, Meng C (2021) Modern subsampling methods for large-scale least squares regression. arXiv preprint arXiv:210501552
Lindley DV (1956) On a measure of the information provided by an experiment. Ann Math Stat 27:986–1005
Article MathSciNet MATH Google Scholar
López-Fidalgo J, Tommasi C, Trandafir PC (2007) An optimal experimental design criterion for discriminating between non-normal models. J R Stat Soc: Ser B 69:231–242
Article MathSciNet MATH Google Scholar
Ma P, Mahoney MW, Yu B (2015) A statistical perspective on algorithmic leveraging. J Mach Learn Res 16:861–919
MathSciNet MATH Google Scholar
Ma P, Zhang X, Xing X, Ma J, Mahoney MW (2020) Asymptotic analysis of sampling estimators for randomized numerical linear algebra algorithms. arXiv preprint arXiv:200210526
Mahoney MW (2012) Randomized algorithms for matrices and data. Found Trends Mach Learn 3:647–672
Google Scholar
Mak S, Joseph VR (2018) Support points. Ann Stat 46:2562–2592
Article MathSciNet MATH Google Scholar
Mamonov S, Triantoro T (2018) Subjectivity of diamond prices in online retail: insights from a data mining study. J Theor Appl Electron Commer Res 13:15–28
Article Google Scholar
McCullagh P, Nelder JA (1989) Generalized linear models. Monographs on Statistics and Applied Probability, vol 37. Chapman & Hall
Meng X, Saunders MA, Mahoney MW (2014) LSRN: a parallel iterative solver for strongly over-or underdetermined systems. SIAM J Sci Comput 36:C95–C118
Article MathSciNet MATH Google Scholar
Meng C, Wang Y, Zhang X, Mandal A, Ma P, Zhong W (2017) Effective statistical methods for big data analytics. In: Handbook of Research on Applied Cybernetics and Systems Science, pp 280–299
Meng C, Xie R, Mandal A, Zhang X, Zhong W, Ma P (2020a) Lowcon: a design-based subsampling approach in a misspecified linear model. J Comput Gr Stat. In press
Meng C, Zhang X, Zhang J, Zhong W, Ma P (2020b) More efficient approximation of smoothing splines via space-filling basis selection. Biometrika 107:723–735
Meyer RD, Steinberg DM, Box G (1996) Follow-up designs to resolve confounding in multifactor experiments. Technometrics 38:303–313
Article MATH Google Scholar
Miller A (2002) Subset selection in regression. CRC Press, Boca Raton
Book MATH Google Scholar
Ng S (2017) Opportunities and challenges: lessons from analyzing terabytes of scanner data. Tech. rep., National Bureau of Economic Research
Papailiopoulos D, Kyrillidis A, Boutsidis C (2014) Provable deterministic leverage score sampling. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, pp 997–1006
Pukelsheim F (2006) Optimal design of experiments. Society for Industrial and Applied Mathematics
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464
Article MathSciNet MATH Google Scholar
Sebastiani P, Wynn HP (2000) Maximum entropy sampling and optimal Bayesian experimental design. J R Stat Soc: Ser B 62:145–157
Article MathSciNet MATH Google Scholar
Shao J (1997) An asymptotic theory for linear model selection. Stat Sin 7:221–264
MathSciNet MATH Google Scholar
Shewry MC, Wynn HP (1987) Maximum entropy sampling. J Appl Stat 14:165–170
Article Google Scholar
Simon N, Friedman J, Hastie T, Tibshirani R (2011) Regularization paths for cox’s proportional hazards model via coordinate descent. J Stat Softw 39:1–13
Article Google Scholar
Sin CY, White H (1996) Information criteria for selecting possibly misspecified parametric models. J Econom 71:207–225
Article MathSciNet MATH Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc: Ser B 58:267–288
MathSciNet MATH Google Scholar
Truong Y, Kooperberg C, Stone C, Hansen M (2005) Statistical modeling with spline functions: methodology and theory. Springer Series in Statistics, Springer, New York
van der Vaart A (1998) Asymptotic statistics. Cambridge University Press, Cambridge
Book MATH Google Scholar
Wang H (2019) More efficient estimation for logistic regression with optimal subsamples. J Mach Learn Res 20:1–59
MathSciNet MATH Google Scholar
Wang H, Zhu R, Ma P (2018) Optimal subsampling for large sample logistic regression. J Am Stat Assoc 113:829–844
Article MathSciNet MATH Google Scholar
Wang H, Yang M, Stufken J (2019) Information-based optimal subdata selection for big data linear regression. J Am Stat Assoc 114:393–405
Article MathSciNet MATH Google Scholar
Xu C, Chen J, Mantel H (2013) Pseudo-likelihood-based Bayesian information criterion for variable selection in survey data. Surv Methodol 39:303–321
Google Scholar
Yang Y (2005) Can the strengths of AIC and BIC be shared? A conflict between model indentification and regression estimation. Biometrika 92:937–950
Article MathSciNet MATH Google Scholar
Yao Y, Wang H (2019) Optimal subsampling for softmax regression. Stat Pap 60:585–599
Article MathSciNet MATH Google Scholar
Yao Y, Wang H (2021) A selective review on statistical techniques for big data. In: Modern statistical methods for health research. Springer. In press
Yuan Z, Yang Y (2005) Combining linear regression models: when and how? J Am Stat Assoc 100:1202–1214
Article MathSciNet MATH Google Scholar
Zhang C (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38:894–942
Article MathSciNet MATH Google Scholar
Zhang T, Ning Y, Ruppert D (2020) Optimal sampling for generalized linear models under measurement constraints. J Comput Gr Stat. In press
Zheng C, Ferrari D, Yang Y (2019) Model selection confidence sets by likelihood ratio testing. Stat Sin 29:827–851
MathSciNet MATH Google Scholar

Download references

Acknowledgements

The authors sincerely thank the editor, associate editor, and referees for their valuable comments and insightful suggestions, which led to further improvement of this article. The authors are also grateful to professors Mingyao Ai and Ping Ma for helpful discussions. This work is supported by NSFC (Grant No. 12001042) and Beijing Institute of Technology Research Fund Program for Young Scholars and also supported by National Science Foundation (Grant No. 2105571).

Author information

Authors and Affiliations

School of Mathematics and Statistics, and key laboratory of mathematical theory and computation in information security, Beijing Institute of Technology, Beijing, 100811, China
Jun Yu
Department of Statistics, University of Connecticut, Storrs, CT, 06269, USA
HaiYing Wang

Authors

Jun Yu
View author publications
You can also search for this author in PubMed Google Scholar
HaiYing Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to HaiYing Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

Technical details

Proof of Theorem 1

For any given subdata $X^*$, by applying the entropy decomposition in information theory (Sebastiani and Wynn 2000, Equation (2)), the joint entropy of $Y^*$ and the parameter $\varTheta $ can be decomposed as

$$\begin{aligned} \begin{aligned} \mathrm {Ent}(Y^*,\varTheta |X^*)&= \mathrm {Ent}(\varTheta |X^*)+ E_{\varTheta }\{\mathrm {Ent}(Y^*|\varTheta ,X^*)\} \\&= \mathrm {Ent}(\varTheta )+ \mathrm {Ent}(\varvec{\varepsilon }^*), \end{aligned} \end{aligned}$$

(A.1)

where $\varvec{\varepsilon }^*$ stands for the corresponding error term of $(Y^*,X^*)$ in model (1). The second equality holds by the model assumption since all the randomness of $Y^*$ comes from the error term conditional on $\varTheta ,X^*$ and $\varTheta $ is functionally independent of $X^*$. This implies that $\mathrm {Ent}(Y^*,\varTheta |X^*)$ is a constant up to the subdata size r.

Also note that $\mathrm {Ent}(Y^*,\varTheta |X^*)$ can also be decomposed as

$$\begin{aligned} \mathrm {Ent}(Y^*,\varTheta |X^*)=\mathrm {Ent}(Y^*|X^*)+E_{Y^*}\{\mathrm {Ent}(\varTheta |Y^*,X^*)\}. \end{aligned}$$

(A.2)

That is to say maximizing $\mathrm {Ent}(Y^*|X^*)$ indicates minimizing the overall expected deviance loss $E_{Y^*}\{\mathrm {Ent}(\varTheta |Y^*,X^*)\}$.

Now we turns to calculate $\mathrm {Ent}(Y^*|X^*)$. Without loss of generality, we assume that the first $p_{(t)}$ columns of X be the model matrix of the true model. Thus the prior of $\beta _{(t)}$ comes from $N(\beta _\mathrm{prior,(t)},\sigma _f^2I_{p_{(t)}})$, where $\beta _\mathrm{prior,(t)}$ corresponds to the first $p_{(t)}$ entries of $\beta _\mathrm{prior}$. Note that $Y^*=X_{(t)}^*\beta _{(t)}+\varvec{\varepsilon }$ and the prior of $\beta _{(t)}$ obeys $N(\beta _\mathrm{prior,(t)},\sigma _f^2I_{p_{(t)}})$. Thus the marginal distribution of $Y^*$ is normal with mean $X_{(t)}^*\beta _\mathrm{prior,(t)}$ and variance $\sigma ^2I_t+\sigma _f^{2}X_{(t)}^{*{\mathrm {T} }}X_{(t)}^*$ under model (1). The desired results come from the facts

$$\begin{aligned} \begin{aligned} \mathrm {Ent}(Y^*|X^*)&=\log \det (\sigma ^2I_r+\sigma _f^{2}X_{(t)}^{*}X_{(t)}^{*{\mathrm {T} }})+c_1\\&=\log \det (\sigma ^{-2}\sigma _f^{2}X_{(t)}^{*{\mathrm {T} }}X_{(t)}^{*}+I_t)+c_2,\\&=\log \det (X_{(t)}^{*{\mathrm {T} }}X_{(t)}^{*}+\sigma ^{2}\sigma _f^{-2}I_t)+c_3, \end{aligned} \end{aligned}$$

(A.3)

where $c_1,c_2,c_3$ are some constant up to the subdata size r. The second equality comes from the matrix determinant lemma, i.e., $\det (A+BC)=\det (A)\det (I+CA^{-1}B)$ for some matrices A, B, C with $A>0$. $\square $

Proof of Theorem 2

It is sufficient to show that $X^{*{\mathrm {T} }}X^*\le (\sum _{i=1}^n\delta _i h_{ii}) X^{{\mathrm {T} }}X$ in the sense of Loewner ordering. Let ${{x}}_i$ be the ith row of X. For any $ a\in \mathbb {R}^{p}$, noting that $X^{\mathrm {T} }X$ is a full rank matrix, a can be represent as $ a=(X^{\mathrm {T} }X)^{-1/2} b$ for some $ b\in \mathbb {R}^{p}$. Then,

$$\begin{aligned} a^{\mathrm {T} }{{x}}_i {{x}}_i^{\mathrm {T} }a= & {} b^{\mathrm {T} }(X^{\mathrm {T} }X)^{-1/2} {{x}}_i^{\mathrm {T} }{{x}}_i(X^{\mathrm {T} }X)^{-1/2} b \nonumber \\\le & {} \mathrm {tr}\{(X^{\mathrm {T} }X)^{-1/2} {{x}}_i^{\mathrm {T} }{{x}}_i(X^{\mathrm {T} }X)^{-1/2}\}\Vert b\Vert _2^2 \nonumber \\= & {} h_{ii}\{b^{\mathrm {T} }(X^{\mathrm {T} }X)^{-1/2}(X^{\mathrm {T} }X)(X^{\mathrm {T} }X)^{-1/2} b\} \end{aligned}$$

(A.4)

$$\begin{aligned}= & {} h_{ii} a^{\mathrm {T} }(X^{\mathrm {T} }X) a, \end{aligned}$$

(A.5)

where $\mathrm {tr}(\cdot )$ is the trace operator and $(X^{\mathrm {T} }X)^{-1/2}(X^{\mathrm {T} }X)^{-1/2}=(X^{\mathrm {T} }X)^{-1}$. Therefore, $ {{x}}_i {{x}}_i^{\mathrm {T} }\le h_{ii}X^{\mathrm {T} }X$ and the desired result comes from summing over the both side of the inequality. $\square $

For clarity, we begin with the proof of the following lemma since some results in the following lemma will be used in the proof of Theorem 3.

Lemma 1

Assume that $n^{-1}X^TX$ goes to a positive definite matrix. Let $\hat{\beta }_k^*$ be the MLE based on selected subdata set according to Algorithm 1 for the kth candidate model. As $r\rightarrow \infty ,n\rightarrow \infty $, the following result holds:

$$\begin{aligned} \mathrm{Var}(\hat{\beta }_k^*)=O\left( \frac{1}{n\sum _{i=1}^rh_{(ii)}}\right) . \end{aligned}$$

(A.6)

Proof of Lemma 1

According to Algorithm 1, $X^*=U_\varGamma \varSigma V^{\mathrm {T} }$. Then it is sufficient to show that

$$\begin{aligned} c\sum _{i=1}^{r}h_{(ii)} \le \lambda _{\min } (U_\varGamma ^TU_\varGamma ) \le \lambda _{\max } (U_\varGamma ^TU_\varGamma )\le \sum _{i=1}^{r}h_{(ii)}, \end{aligned}$$

(A.7)

for some constant c, where $\lambda _{\max }(A)$, $\lambda _{\min }(A)$ stand for the maximum and minimum eigenvalue of A, respectively. Since $U_\varGamma ^{\mathrm {T} }U_\varGamma $ is positive definite through Algorithm 1, therefore $\lambda _{\max } (U_\varGamma ^{\mathrm {T} }U_\varGamma )\le \mathrm {tr}(U_\varGamma ^{\mathrm {T} }U_\varGamma )=\sum _{i=1}^{r}h_{(ii)}$. By the definition of the condition number, it holds that

$$\begin{aligned} \lambda _{\min } (U_\varGamma ^{\mathrm {T} }U_\varGamma )=\lambda _{\max } (U_\varGamma ^{\mathrm {T} }U_\varGamma )/\kappa (U_\varGamma ^{\mathrm {T} }U_\varGamma )\ge \mathrm {tr}(U_\varGamma ^{\mathrm {T} }U_\varGamma )/(pT), \end{aligned}$$

where the last inequality comes from the fact that $\mathrm {tr}(U_\varGamma ^{\mathrm {T} }U_\varGamma )\le p\lambda _{\max }(U_\varGamma ^{\mathrm {T} }U_\varGamma )$.

From (A.7), it follows that

$$\begin{aligned} c\sum _{i=1}^{r}h_{(ii)}V\varSigma ^2 V^T \le X^{*T}X^* \le \sum _{i=1}^{r}h_{(ii)}V\varSigma ^2 V^T, \end{aligned}$$

(A.8)

and the desired results follows by noting $X^TX=V\varSigma ^2 V^T$ and $\mathrm{Var}(\hat{\beta }_k)=(P^TX^{*T}X^*P)^{-1}$ for the projection matrix P such that $X_{(k)}=XP$ where $X_{(k)}$ is the design matrix for model $S_k$. $\square $

Now, let us turn to proof Theorem 3.

Proof of Theorem 3

Denote $\mathcal {M}^C$ be the set of correct candidate models, and $\mathcal {M}^I=\mathcal {M}-\mathcal {M}^C$ be the set of incorrect candidate models. We first show that

$$\begin{aligned} \varDelta (k)=\liminf _r\min _{S_k\in \mathcal {M}^I }\Vert \mu ^*-H_{(k)}^*\mu ^*\Vert ^2/\log r\rightarrow \infty , \end{aligned}$$

(A.9)

where $\mu ^*$ stands for the mean of the selected data, $H_{(k)}^*=X_{(k)}^*(X_{(k)}^{*{\mathrm {T} }}X_{(k)}^*)^{-1}X_{(k)}^{*{\mathrm {T} }}.$

For any candidate model in $\mathcal {M}^{I}$, say $S_k$ as an example, let the model matrix for the closest correct model be $\tilde{X}_{(k)}^*:=(X_{(\check{c})}^*,X_{(k)}^*)$. Here $X^*_{(\check{c})}$ stands for the “complementary” design, which consists of the columns of $X_{(t)}^*$ that are not included in $X_{(k)}^*$. Denote the regression coefficient vector corresponding to $X_{(\check{c})}^*$ as $\beta _{(\check{c})}$, which is a subvector of $\beta _{(t)}$. Direct calculation yields

$$\begin{aligned} \Vert \mu ^*-H_{(k)}^*\mu ^*\Vert ^2= & {} \inf _{\alpha }\Vert X^*_{(\check{c})}\beta _{(\check{c})}-X^*_{(k)}\alpha \Vert ^2 \end{aligned}$$

(A.10)

$$\begin{aligned}= & {} \inf _{\alpha }\{(\beta _{(\check{c})}^{\mathrm {T} },\alpha ^{{\mathrm {T} }})(\tilde{X}_{(k)}^{*{\mathrm {T} }}\tilde{X}^*_{(k)})(\beta _{(\check{c})}^{\mathrm {T} },\alpha ^{{\mathrm {T} }})^{{\mathrm {T} }}\}. \end{aligned}$$

(A.11)

Utilizing the results in (A.8), we have

$$\begin{aligned} \left( c_2n\sum _{i=1}^{r}h_{(ii)}\right) \left( \frac{1}{n}X^{\mathrm {T} }X\right) \le X^{*{\mathrm {T} }}X^* \le \left( n\sum _{i=1}^{r}h_{(ii)}\right) \left( \frac{1}{n}X^{\mathrm {T} }X\right) , \end{aligned}$$

(A.12)

for some constant $c_2$. Note that the $\tilde{X}_{(k)}$ is a submatrix of X up to a column permutation. Thus $\lambda _{\min }(\tilde{X}_{(k)}^{*{\mathrm {T} }}\tilde{X}^*_{(k)})\ge \lambda _{\min }(\tilde{X}^{*{\mathrm {T} }}\tilde{X}^*)=O\left( n\sum _{i=1}^{r}h_{(ii)}\right) $, where $\lambda _{\min }(\cdot )$ stands for the smallest eigenvalue of a squared matrix. From (A.10), we have

$$\begin{aligned} \liminf _{n,r\rightarrow \infty }\min _{S_k\in \mathcal {M}^I }\Vert \mu ^*-H_{(k)}^*\mu ^*\Vert ^2/\log r\ge \liminf _{n,r\rightarrow \infty }\min _j(n\sum _{i=1}^{r}h_{(ii)})\Vert \beta _{(t)j}\Vert ^2/\log r\rightarrow \infty , \end{aligned}$$

which implies (A.9) holds.

For convenience, let ${\text {BIC}^*}(S_k)=r\log \left( r^{-1}\sum _{i=1}^r(y_i^*-{\hat{\mu }}_i^{*(k)})^2\right) +(p_{(k)}+1)\log r.$ From (3.7) in Shao (1997), for any model $S_k$ in $\mathcal {M}^I$, we have

$$\begin{aligned} \sum _{i=1}^r(y_i^*-{\hat{\mu }}_i^{*(k)})^2-\sum _{i=1}^r(y_i^*-{\hat{\mu }}_i^{*(t)})^2\ge \varDelta (k)\ge p\log r>0, \end{aligned}$$

(A.13)

which implies $\log (\sum _{i=1}^r(y_i^*-{\hat{\mu }}_i^{*(k)})^2)-\log (\sum _{i=1}^r(y_i^*-{\hat{\mu }}_i^{*(t)})^2)>r\log (1+p\log r/r)$ under the assumption $\hat{\sigma }^*\not \rightarrow 0.$ Therefore,

$$\begin{aligned} {\text {BIC}^*}(S_k)-{\text {BIC}^*}(S_t)\ge r\log (1+{p\log r}/{r})-(p-p_{(k)})\log r\rightarrow \infty .\nonumber \\ \end{aligned}$$

(A.14)

Similarly, for any model $S_{k^\prime }$ in $\mathcal {M}^C$ with $p_{(k^\prime )}>p_{(t)},$ where $p_{(t)}$ is the column dimension of $X_{(t)}$, it is straightforward to see that

$$\begin{aligned} r\log \left( r^{-1}\sum _{i=1}^r(y_i^*-{\hat{\mu }}_i^{*(k^\prime )})^2\right) -r\log \left( r^{-1}\sum _{i=1}^r(y_i^*-{\hat{\mu }}_i^{*(t)})^2\right) \rightarrow \chi ^2_{p_{(k^\prime )}-p_{(t)}},\nonumber \\ \end{aligned}$$

(A.15)

according to the log likelihood ratio test (see van der Vaart 1998, Chapter 16). Therefore, it holds that

$$\begin{aligned} {\text {BIC}^*}(S_{k^\prime })-{\text {BIC}^*}(S_t)=O(\log r)\rightarrow \infty . \end{aligned}$$

(A.16)

Combining (A.14) and (A.16), we can get the desired result. $\square $

Proof of Theorem 4

This is the direct result from Lemma 1.

Additional simulation results on forward regression

Since the number of possible models increases exponentially with p, all-subset regression is only feasible for the cases that p is relatively small. Alternatively, a forward selection approach is usually adopted. More precisely, the forward regression starts from the null model, and iteratively adds one variable to the currently “best” model which yields the lowest value for the BIC at a time. This process is repeated until no more variables should be added into the currently “best” model. In this part, we adapt the forward regression to illustrate the proposed method. Of course, a backward elimination procedure and a step-wise regression procedure can also be adopted. Since the three methods have similar performance, we only report the results on forward regression.

In accordance with Sect. 5.1, we also demonstrate our method as well as the uniform subsampling, leveraging score subsampling and IBOSS through the six cases. The selection accuracies for the six cases are presented in Figure 5. The log MSPEs are also provided in Figure 6 to evaluate the performance of prediction.

From Figures 5, and 6, one can see that the forward regression results based on BIC are very similar to the results on the all-subset regressions on BIC.

Additional simulation results on Lasso

Now, we will exam our method’s performance on model selection according to Lasso (Tibshirani 1996).

To be aligned with the settings described in Sect. 5.1, we also demonstrate our methods as well as the other three methods (i.e., uniform subsampling, leverage score subsampling, and the IBOSS) on the six cases listed at the beginning of Sect. 5.1 and evaluate the selection performance through the selection accuracies and MSPEs. The Lasso method is conducted through the glmnet package (Simon et al. 2011) and the tuning parameters are selected through 10-fold cross-validation according to the cv.glmnet() function. As for the leverage score subsampling, the Lasso is conduct as in Leng and Leung (2011).

Results on the selection accuracies are presented in Figure 7. It can be seen that the selection results based on Lasso are very similar to the results on the subset selection based on BIC.

To see the benefits of the model selection, we also report the log MSPEs in Figure 8 with $n_\mathrm{test}=500.$ From Figure 8, we can clearly see that the IBOSS and our method (LEVSS) are uniformly performance better than the uniform subsampling method.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yu, J., Wang, H. Subdata selection algorithm for linear model discrimination. Stat Papers 63, 1883–1906 (2022). https://doi.org/10.1007/s00362-022-01299-8

Download citation

Received: 12 May 2021
Revised: 11 February 2022
Accepted: 12 February 2022
Published: 03 March 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s00362-022-01299-8

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Subdata selection algorithm for linear model discrimination

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

A survey of Bayesian Network structure learning

References

Acknowledgements