Skip to main content
Log in

Subdata selection algorithm for linear model discrimination

  • Regular Article
  • Published:
Statistical Papers Aims and scope Submit manuscript

Abstract

A statistical method is likely to be sub-optimal if the assumed model does not reflect the structure of the data at hand. For this reason, it is important to perform model selection before statistical analysis. However, selecting an appropriate model from a large candidate pool is usually computationally infeasible when faced with a massive data set, and little work has been done to study data selection for model selection. In this work, we propose a subdata selection method based on leverage scores which enables us to conduct the selection task on a small subdata set. Compared with existing subsampling methods, our method not only improves the probability of selecting the best model but also enhances the estimation efficiency. We justify this both theoretically and numerically. Several examples are presented to illustrate the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19:716–723

    Article  MathSciNet  MATH  Google Scholar 

  • Atkinson AC, Fedorov VV (1975) The design of experiments for discriminating between two rival models. Biometrika 62:57–70

    Article  MathSciNet  MATH  Google Scholar 

  • Bingham DR, Chipman HA (2007) Incorporating prior information in optimal design for model selection. Technometrics 49:155–163

    Article  MathSciNet  Google Scholar 

  • Boivin J, Ng S (2006) Are more data always better for factor analysis? J Econom 132:169–194

    Article  MathSciNet  MATH  Google Scholar 

  • Box GEP, Hill WJ (1967) Discrimination among mechanistic models. Technometrics 9:57–71

    Article  MathSciNet  Google Scholar 

  • Candes E, Tao T et al (2007) The Dantzig selector: statistical estimation when p is much larger than n. Ann Stat 35:2313–2351

    MathSciNet  MATH  Google Scholar 

  • Chakrabortty A, Cai T (2018) Efficient and adaptive linear regression in semi-supervised settings. Ann Stat 46:1541–1572

    Article  MathSciNet  MATH  Google Scholar 

  • Chen WY, Mackey L, Gorham J, Briol FX, Oates C (2018) Stein points. In: Dy J, Krause A (eds) Proceedings of the 35th international conference on machine learning, vol 80, pp 844–853

  • Chipman HA, Hamada MS (1996) Discussion: factor-based or effect-based modeling? implications for design. Technometrics 38:317–320

    Article  Google Scholar 

  • Claeskens G, Hjort NL (2008) Model selection and model averaging. Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press, Cambridge

  • Consonni G, Deldossi L (2016) Objective Bayesian model discrimination in follow-up experimental designs. TEST 25:397–412

    Article  MathSciNet  MATH  Google Scholar 

  • Deldossi L, Tommasi C (2021) Optimal design subsampling from big datasets. J Qual Technol. In press

  • Dereziński M, Warmuth MK (2018) Reverse iterative volume sampling for linear regression. J Mach Learn Res 19:1–39

    MathSciNet  MATH  Google Scholar 

  • Dette H, Titoff S (2009) Optimal discrimination designs. Ann Stat 37:2056–2082

    Article  MathSciNet  MATH  Google Scholar 

  • Dette H, Melas VB, Guchenko R (2015) Bayesian T-optimal discriminating designs. Ann Stat 43:1959–1985

    Article  MathSciNet  MATH  Google Scholar 

  • Drineas P, Kannan R, Mahoney MW (2006) Fast Monte Carlo algorithms for matrices I: approximating matrix multiplication. SIAM J Comput 36:132–157

    Article  MathSciNet  MATH  Google Scholar 

  • Drineas P, Mahoney MW, Muthukrishnan S, Sarlós T (2011) Faster least squares approximation. Numerische Mathematik 117:219–249

    Article  MathSciNet  MATH  Google Scholar 

  • Drovandi CC, McGree JM, Pettitt AN (2014) A sequential Monte Carlo algorithm to incorporate model uncertainty in Bayesian sequential design. J Comput Gr Stat 23:3–24

    Article  MathSciNet  Google Scholar 

  • Efron B, Hastie T, Johnstone I, Tibshirani R et al (2004) Least angle regression. Ann Stat 32:407–499

    Article  MathSciNet  MATH  Google Scholar 

  • Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360

    Article  MathSciNet  MATH  Google Scholar 

  • Fang KT, Kotz S, Ng KW (1990) Symmetric multivariate and related distributions. Monographs on Statistics and Applied Probability. Springer, Berlin

  • Fithian W, Hastie T (2014) Local case-control sampling: efficient subsampling in imbalanced data sets. Ann Stat 42:1693–1724

    Article  MathSciNet  MATH  Google Scholar 

  • Hastie T, Tibshirani R (1993) Varying-coefficient models. J R Stat Soc: Ser B 55:757–779

    MathSciNet  MATH  Google Scholar 

  • Hastie TJ, Tibshirani RJ (1990) Generalized additive models, vol 43. CRC Press, Boca Raton

    MATH  Google Scholar 

  • Joseph VR, Wang D, Gu L, Lyu S, Tuo R (2019) Deterministic sampling of expensive posteriors using minimum energy designs. Technometrics 61:297–308

    Article  MathSciNet  Google Scholar 

  • Kadane JB, Lazar NA (2004) Methods and criteria for model selection. J Am Stat Assoc 99:279–290

    Article  MathSciNet  MATH  Google Scholar 

  • Kleiner A, Talwalkar A, Sarkar P, Jordan MI (2015) A scalable bootstrap for massive data. J R Stat Soc: Ser B 76:795–816

    Article  MathSciNet  MATH  Google Scholar 

  • Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22:79–86

    Article  MathSciNet  MATH  Google Scholar 

  • Lee S, Ng S (2020) An econometric perspective on algorithmic subsampling. Annu Rev Econ 12:45–80

    Article  Google Scholar 

  • Leng C, Leung DHY (2011) Model selection in validation sampling: an asymptotic likelihood-based lasso approach. Stat Sin 21:659–678

    Article  MathSciNet  MATH  Google Scholar 

  • Li T, Meng C (2021) Modern subsampling methods for large-scale least squares regression. arXiv preprint arXiv:210501552

  • Lindley DV (1956) On a measure of the information provided by an experiment. Ann Math Stat 27:986–1005

    Article  MathSciNet  MATH  Google Scholar 

  • López-Fidalgo J, Tommasi C, Trandafir PC (2007) An optimal experimental design criterion for discriminating between non-normal models. J R Stat Soc: Ser B 69:231–242

    Article  MathSciNet  MATH  Google Scholar 

  • Ma P, Mahoney MW, Yu B (2015) A statistical perspective on algorithmic leveraging. J Mach Learn Res 16:861–919

    MathSciNet  MATH  Google Scholar 

  • Ma P, Zhang X, Xing X, Ma J, Mahoney MW (2020) Asymptotic analysis of sampling estimators for randomized numerical linear algebra algorithms. arXiv preprint arXiv:200210526

  • Mahoney MW (2012) Randomized algorithms for matrices and data. Found Trends Mach Learn 3:647–672

    Google Scholar 

  • Mak S, Joseph VR (2018) Support points. Ann Stat 46:2562–2592

    Article  MathSciNet  MATH  Google Scholar 

  • Mamonov S, Triantoro T (2018) Subjectivity of diamond prices in online retail: insights from a data mining study. J Theor Appl Electron Commer Res 13:15–28

    Article  Google Scholar 

  • McCullagh P, Nelder JA (1989) Generalized linear models. Monographs on Statistics and Applied Probability, vol 37. Chapman & Hall

  • Meng X, Saunders MA, Mahoney MW (2014) LSRN: a parallel iterative solver for strongly over-or underdetermined systems. SIAM J Sci Comput 36:C95–C118

    Article  MathSciNet  MATH  Google Scholar 

  • Meng C, Wang Y, Zhang X, Mandal A, Ma P, Zhong W (2017) Effective statistical methods for big data analytics. In: Handbook of Research on Applied Cybernetics and Systems Science, pp 280–299

  • Meng C, Xie R, Mandal A, Zhang X, Zhong W, Ma P (2020a) Lowcon: a design-based subsampling approach in a misspecified linear model. J Comput Gr Stat. In press

  • Meng C, Zhang X, Zhang J, Zhong W, Ma P (2020b) More efficient approximation of smoothing splines via space-filling basis selection. Biometrika 107:723–735

  • Meyer RD, Steinberg DM, Box G (1996) Follow-up designs to resolve confounding in multifactor experiments. Technometrics 38:303–313

    Article  MATH  Google Scholar 

  • Miller A (2002) Subset selection in regression. CRC Press, Boca Raton

    Book  MATH  Google Scholar 

  • Ng S (2017) Opportunities and challenges: lessons from analyzing terabytes of scanner data. Tech. rep., National Bureau of Economic Research

  • Papailiopoulos D, Kyrillidis A, Boutsidis C (2014) Provable deterministic leverage score sampling. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, pp 997–1006

  • Pukelsheim F (2006) Optimal design of experiments. Society for Industrial and Applied Mathematics

  • Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464

    Article  MathSciNet  MATH  Google Scholar 

  • Sebastiani P, Wynn HP (2000) Maximum entropy sampling and optimal Bayesian experimental design. J R Stat Soc: Ser B 62:145–157

    Article  MathSciNet  MATH  Google Scholar 

  • Shao J (1997) An asymptotic theory for linear model selection. Stat Sin 7:221–264

    MathSciNet  MATH  Google Scholar 

  • Shewry MC, Wynn HP (1987) Maximum entropy sampling. J Appl Stat 14:165–170

    Article  Google Scholar 

  • Simon N, Friedman J, Hastie T, Tibshirani R (2011) Regularization paths for cox’s proportional hazards model via coordinate descent. J Stat Softw 39:1–13

    Article  Google Scholar 

  • Sin CY, White H (1996) Information criteria for selecting possibly misspecified parametric models. J Econom 71:207–225

    Article  MathSciNet  MATH  Google Scholar 

  • Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc: Ser B 58:267–288

    MathSciNet  MATH  Google Scholar 

  • Truong Y, Kooperberg C, Stone C, Hansen M (2005) Statistical modeling with spline functions: methodology and theory. Springer Series in Statistics, Springer, New York

  • van der Vaart A (1998) Asymptotic statistics. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  • Wang H (2019) More efficient estimation for logistic regression with optimal subsamples. J Mach Learn Res 20:1–59

    MathSciNet  MATH  Google Scholar 

  • Wang H, Zhu R, Ma P (2018) Optimal subsampling for large sample logistic regression. J Am Stat Assoc 113:829–844

    Article  MathSciNet  MATH  Google Scholar 

  • Wang H, Yang M, Stufken J (2019) Information-based optimal subdata selection for big data linear regression. J Am Stat Assoc 114:393–405

    Article  MathSciNet  MATH  Google Scholar 

  • Xu C, Chen J, Mantel H (2013) Pseudo-likelihood-based Bayesian information criterion for variable selection in survey data. Surv Methodol 39:303–321

    Google Scholar 

  • Yang Y (2005) Can the strengths of AIC and BIC be shared? A conflict between model indentification and regression estimation. Biometrika 92:937–950

    Article  MathSciNet  MATH  Google Scholar 

  • Yao Y, Wang H (2019) Optimal subsampling for softmax regression. Stat Pap 60:585–599

    Article  MathSciNet  MATH  Google Scholar 

  • Yao Y, Wang H (2021) A selective review on statistical techniques for big data. In: Modern statistical methods for health research. Springer. In press

  • Yuan Z, Yang Y (2005) Combining linear regression models: when and how? J Am Stat Assoc 100:1202–1214

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang C (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38:894–942

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang T, Ning Y, Ruppert D (2020) Optimal sampling for generalized linear models under measurement constraints. J Comput Gr Stat. In press

  • Zheng C, Ferrari D, Yang Y (2019) Model selection confidence sets by likelihood ratio testing. Stat Sin 29:827–851

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The authors sincerely thank the editor, associate editor, and referees for their valuable comments and insightful suggestions, which led to further improvement of this article. The authors are also grateful to professors Mingyao Ai and Ping Ma for helpful discussions. This work is supported by NSFC (Grant No. 12001042) and Beijing Institute of Technology Research Fund Program for Young Scholars and also supported by National Science Foundation (Grant No. 2105571).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to HaiYing Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

Technical details

Proof of Theorem 1

For any given subdata \(X^*\), by applying the entropy decomposition in information theory (Sebastiani and Wynn 2000, Equation (2)), the joint entropy of \(Y^*\) and the parameter \(\varTheta \) can be decomposed as

$$\begin{aligned} \begin{aligned} \mathrm {Ent}(Y^*,\varTheta |X^*)&= \mathrm {Ent}(\varTheta |X^*)+ E_{\varTheta }\{\mathrm {Ent}(Y^*|\varTheta ,X^*)\} \\&= \mathrm {Ent}(\varTheta )+ \mathrm {Ent}(\varvec{\varepsilon }^*), \end{aligned} \end{aligned}$$
(A.1)

where \(\varvec{\varepsilon }^*\) stands for the corresponding error term of \((Y^*,X^*)\) in model (1). The second equality holds by the model assumption since all the randomness of \(Y^*\) comes from the error term conditional on \(\varTheta ,X^*\) and \(\varTheta \) is functionally independent of \(X^*\). This implies that \(\mathrm {Ent}(Y^*,\varTheta |X^*)\) is a constant up to the subdata size r.

Also note that \(\mathrm {Ent}(Y^*,\varTheta |X^*)\) can also be decomposed as

$$\begin{aligned} \mathrm {Ent}(Y^*,\varTheta |X^*)=\mathrm {Ent}(Y^*|X^*)+E_{Y^*}\{\mathrm {Ent}(\varTheta |Y^*,X^*)\}. \end{aligned}$$
(A.2)

That is to say maximizing \(\mathrm {Ent}(Y^*|X^*)\) indicates minimizing the overall expected deviance loss \(E_{Y^*}\{\mathrm {Ent}(\varTheta |Y^*,X^*)\}\).

Now we turns to calculate \(\mathrm {Ent}(Y^*|X^*)\). Without loss of generality, we assume that the first \(p_{(t)}\) columns of X be the model matrix of the true model. Thus the prior of \(\beta _{(t)}\) comes from \(N(\beta _\mathrm{prior,(t)},\sigma _f^2I_{p_{(t)}})\), where \(\beta _\mathrm{prior,(t)}\) corresponds to the first \(p_{(t)}\) entries of \(\beta _\mathrm{prior}\). Note that \(Y^*=X_{(t)}^*\beta _{(t)}+\varvec{\varepsilon }\) and the prior of \(\beta _{(t)}\) obeys \(N(\beta _\mathrm{prior,(t)},\sigma _f^2I_{p_{(t)}})\). Thus the marginal distribution of \(Y^*\) is normal with mean \(X_{(t)}^*\beta _\mathrm{prior,(t)}\) and variance \(\sigma ^2I_t+\sigma _f^{2}X_{(t)}^{*{\mathrm {T} }}X_{(t)}^*\) under model (1). The desired results come from the facts

$$\begin{aligned} \begin{aligned} \mathrm {Ent}(Y^*|X^*)&=\log \det (\sigma ^2I_r+\sigma _f^{2}X_{(t)}^{*}X_{(t)}^{*{\mathrm {T} }})+c_1\\&=\log \det (\sigma ^{-2}\sigma _f^{2}X_{(t)}^{*{\mathrm {T} }}X_{(t)}^{*}+I_t)+c_2,\\&=\log \det (X_{(t)}^{*{\mathrm {T} }}X_{(t)}^{*}+\sigma ^{2}\sigma _f^{-2}I_t)+c_3, \end{aligned} \end{aligned}$$
(A.3)

where \(c_1,c_2,c_3\) are some constant up to the subdata size r. The second equality comes from the matrix determinant lemma, i.e., \(\det (A+BC)=\det (A)\det (I+CA^{-1}B)\) for some matrices ABC with \(A>0\). \(\square \)

Proof of Theorem 2

It is sufficient to show that \(X^{*{\mathrm {T} }}X^*\le (\sum _{i=1}^n\delta _i h_{ii}) X^{{\mathrm {T} }}X\) in the sense of Loewner ordering. Let \({{x}}_i\) be the ith row of X. For any \( a\in \mathbb {R}^{p}\), noting that \(X^{\mathrm {T} }X\) is a full rank matrix, a can be represent as \( a=(X^{\mathrm {T} }X)^{-1/2} b\) for some \( b\in \mathbb {R}^{p}\). Then,

$$\begin{aligned} a^{\mathrm {T} }{{x}}_i {{x}}_i^{\mathrm {T} }a= & {} b^{\mathrm {T} }(X^{\mathrm {T} }X)^{-1/2} {{x}}_i^{\mathrm {T} }{{x}}_i(X^{\mathrm {T} }X)^{-1/2} b \nonumber \\\le & {} \mathrm {tr}\{(X^{\mathrm {T} }X)^{-1/2} {{x}}_i^{\mathrm {T} }{{x}}_i(X^{\mathrm {T} }X)^{-1/2}\}\Vert b\Vert _2^2 \nonumber \\= & {} h_{ii}\{b^{\mathrm {T} }(X^{\mathrm {T} }X)^{-1/2}(X^{\mathrm {T} }X)(X^{\mathrm {T} }X)^{-1/2} b\} \end{aligned}$$
(A.4)
$$\begin{aligned}= & {} h_{ii} a^{\mathrm {T} }(X^{\mathrm {T} }X) a, \end{aligned}$$
(A.5)

where \(\mathrm {tr}(\cdot )\) is the trace operator and \((X^{\mathrm {T} }X)^{-1/2}(X^{\mathrm {T} }X)^{-1/2}=(X^{\mathrm {T} }X)^{-1}\). Therefore, \( {{x}}_i {{x}}_i^{\mathrm {T} }\le h_{ii}X^{\mathrm {T} }X\) and the desired result comes from summing over the both side of the inequality. \(\square \)

For clarity, we begin with the proof of the following lemma since some results in the following lemma will be used in the proof of Theorem 3.

Lemma 1

Assume that \(n^{-1}X^TX\) goes to a positive definite matrix. Let \(\hat{\beta }_k^*\) be the MLE based on selected subdata set according to Algorithm 1 for the kth candidate model. As \(r\rightarrow \infty ,n\rightarrow \infty \), the following result holds:

$$\begin{aligned} \mathrm{Var}(\hat{\beta }_k^*)=O\left( \frac{1}{n\sum _{i=1}^rh_{(ii)}}\right) . \end{aligned}$$
(A.6)

Proof of Lemma 1

According to Algorithm 1, \(X^*=U_\varGamma \varSigma V^{\mathrm {T} }\). Then it is sufficient to show that

$$\begin{aligned} c\sum _{i=1}^{r}h_{(ii)} \le \lambda _{\min } (U_\varGamma ^TU_\varGamma ) \le \lambda _{\max } (U_\varGamma ^TU_\varGamma )\le \sum _{i=1}^{r}h_{(ii)}, \end{aligned}$$
(A.7)

for some constant c, where \(\lambda _{\max }(A)\), \(\lambda _{\min }(A)\) stand for the maximum and minimum eigenvalue of A, respectively. Since \(U_\varGamma ^{\mathrm {T} }U_\varGamma \) is positive definite through Algorithm 1, therefore \(\lambda _{\max } (U_\varGamma ^{\mathrm {T} }U_\varGamma )\le \mathrm {tr}(U_\varGamma ^{\mathrm {T} }U_\varGamma )=\sum _{i=1}^{r}h_{(ii)}\). By the definition of the condition number, it holds that

$$\begin{aligned} \lambda _{\min } (U_\varGamma ^{\mathrm {T} }U_\varGamma )=\lambda _{\max } (U_\varGamma ^{\mathrm {T} }U_\varGamma )/\kappa (U_\varGamma ^{\mathrm {T} }U_\varGamma )\ge \mathrm {tr}(U_\varGamma ^{\mathrm {T} }U_\varGamma )/(pT), \end{aligned}$$

where the last inequality comes from the fact that \(\mathrm {tr}(U_\varGamma ^{\mathrm {T} }U_\varGamma )\le p\lambda _{\max }(U_\varGamma ^{\mathrm {T} }U_\varGamma )\).

From (A.7), it follows that

$$\begin{aligned} c\sum _{i=1}^{r}h_{(ii)}V\varSigma ^2 V^T \le X^{*T}X^* \le \sum _{i=1}^{r}h_{(ii)}V\varSigma ^2 V^T, \end{aligned}$$
(A.8)

and the desired results follows by noting \(X^TX=V\varSigma ^2 V^T\) and \(\mathrm{Var}(\hat{\beta }_k)=(P^TX^{*T}X^*P)^{-1}\) for the projection matrix P such that \(X_{(k)}=XP\) where \(X_{(k)}\) is the design matrix for model \(S_k\). \(\square \)

Now, let us turn to proof Theorem 3.

Proof of Theorem 3

Denote \(\mathcal {M}^C\) be the set of correct candidate models, and \(\mathcal {M}^I=\mathcal {M}-\mathcal {M}^C\) be the set of incorrect candidate models. We first show that

$$\begin{aligned} \varDelta (k)=\liminf _r\min _{S_k\in \mathcal {M}^I }\Vert \mu ^*-H_{(k)}^*\mu ^*\Vert ^2/\log r\rightarrow \infty , \end{aligned}$$
(A.9)

where \(\mu ^*\) stands for the mean of the selected data, \(H_{(k)}^*=X_{(k)}^*(X_{(k)}^{*{\mathrm {T} }}X_{(k)}^*)^{-1}X_{(k)}^{*{\mathrm {T} }}.\)

For any candidate model in \(\mathcal {M}^{I}\), say \(S_k\) as an example, let the model matrix for the closest correct model be \(\tilde{X}_{(k)}^*:=(X_{(\check{c})}^*,X_{(k)}^*)\). Here \(X^*_{(\check{c})}\) stands for the “complementary” design, which consists of the columns of \(X_{(t)}^*\) that are not included in \(X_{(k)}^*\). Denote the regression coefficient vector corresponding to \(X_{(\check{c})}^*\) as \(\beta _{(\check{c})}\), which is a subvector of \(\beta _{(t)}\). Direct calculation yields

$$\begin{aligned} \Vert \mu ^*-H_{(k)}^*\mu ^*\Vert ^2= & {} \inf _{\alpha }\Vert X^*_{(\check{c})}\beta _{(\check{c})}-X^*_{(k)}\alpha \Vert ^2 \end{aligned}$$
(A.10)
$$\begin{aligned}= & {} \inf _{\alpha }\{(\beta _{(\check{c})}^{\mathrm {T} },\alpha ^{{\mathrm {T} }})(\tilde{X}_{(k)}^{*{\mathrm {T} }}\tilde{X}^*_{(k)})(\beta _{(\check{c})}^{\mathrm {T} },\alpha ^{{\mathrm {T} }})^{{\mathrm {T} }}\}. \end{aligned}$$
(A.11)

Utilizing the results in (A.8), we have

$$\begin{aligned} \left( c_2n\sum _{i=1}^{r}h_{(ii)}\right) \left( \frac{1}{n}X^{\mathrm {T} }X\right) \le X^{*{\mathrm {T} }}X^* \le \left( n\sum _{i=1}^{r}h_{(ii)}\right) \left( \frac{1}{n}X^{\mathrm {T} }X\right) , \end{aligned}$$
(A.12)

for some constant \(c_2\). Note that the \(\tilde{X}_{(k)}\) is a submatrix of X up to a column permutation. Thus \(\lambda _{\min }(\tilde{X}_{(k)}^{*{\mathrm {T} }}\tilde{X}^*_{(k)})\ge \lambda _{\min }(\tilde{X}^{*{\mathrm {T} }}\tilde{X}^*)=O\left( n\sum _{i=1}^{r}h_{(ii)}\right) \), where \(\lambda _{\min }(\cdot )\) stands for the smallest eigenvalue of a squared matrix. From (A.10), we have

$$\begin{aligned} \liminf _{n,r\rightarrow \infty }\min _{S_k\in \mathcal {M}^I }\Vert \mu ^*-H_{(k)}^*\mu ^*\Vert ^2/\log r\ge \liminf _{n,r\rightarrow \infty }\min _j(n\sum _{i=1}^{r}h_{(ii)})\Vert \beta _{(t)j}\Vert ^2/\log r\rightarrow \infty , \end{aligned}$$

which implies (A.9) holds.

For convenience, let \({\text {BIC}^*}(S_k)=r\log \left( r^{-1}\sum _{i=1}^r(y_i^*-{\hat{\mu }}_i^{*(k)})^2\right) +(p_{(k)}+1)\log r.\) From (3.7) in Shao (1997), for any model \(S_k\) in \(\mathcal {M}^I\), we have

$$\begin{aligned} \sum _{i=1}^r(y_i^*-{\hat{\mu }}_i^{*(k)})^2-\sum _{i=1}^r(y_i^*-{\hat{\mu }}_i^{*(t)})^2\ge \varDelta (k)\ge p\log r>0, \end{aligned}$$
(A.13)

which implies \(\log (\sum _{i=1}^r(y_i^*-{\hat{\mu }}_i^{*(k)})^2)-\log (\sum _{i=1}^r(y_i^*-{\hat{\mu }}_i^{*(t)})^2)>r\log (1+p\log r/r)\) under the assumption \(\hat{\sigma }^*\not \rightarrow 0.\) Therefore,

$$\begin{aligned} {\text {BIC}^*}(S_k)-{\text {BIC}^*}(S_t)\ge r\log (1+{p\log r}/{r})-(p-p_{(k)})\log r\rightarrow \infty .\nonumber \\ \end{aligned}$$
(A.14)

Similarly, for any model \(S_{k^\prime }\) in \(\mathcal {M}^C\) with \(p_{(k^\prime )}>p_{(t)},\) where \(p_{(t)}\) is the column dimension of \(X_{(t)}\), it is straightforward to see that

$$\begin{aligned} r\log \left( r^{-1}\sum _{i=1}^r(y_i^*-{\hat{\mu }}_i^{*(k^\prime )})^2\right) -r\log \left( r^{-1}\sum _{i=1}^r(y_i^*-{\hat{\mu }}_i^{*(t)})^2\right) \rightarrow \chi ^2_{p_{(k^\prime )}-p_{(t)}},\nonumber \\ \end{aligned}$$
(A.15)

according to the log likelihood ratio test (see van der Vaart 1998, Chapter 16). Therefore, it holds that

$$\begin{aligned} {\text {BIC}^*}(S_{k^\prime })-{\text {BIC}^*}(S_t)=O(\log r)\rightarrow \infty . \end{aligned}$$
(A.16)

Combining (A.14) and (A.16), we can get the desired result. \(\square \)

Proof of Theorem 4

This is the direct result from Lemma 1.

Fig. 5
figure 5

Selection accuracies for different r based on UNIF (solid line with circles), LEVSS (dashed line with triangles), LEV (dotted line with squares) and IBOSS (dotdash line with plus signs) method for the six cases listed in Sect. 5.1. The models are selected based on the forward regression procedure via BIC

Fig. 6
figure 6

Log MSPEs for different r based on UNIF (solid line with circles), LEVSS (dashed line with triangles), LEV (dotted line with squares) and IBOSS (dotdash line with plus signs) methods for the six cases listed in Sect. 5.1. The models are selected based on the forward regression via BIC

Additional simulation results on forward regression

Since the number of possible models increases exponentially with p, all-subset regression is only feasible for the cases that p is relatively small. Alternatively, a forward selection approach is usually adopted. More precisely, the forward regression starts from the null model, and iteratively adds one variable to the currently “best” model which yields the lowest value for the BIC at a time. This process is repeated until no more variables should be added into the currently “best” model. In this part, we adapt the forward regression to illustrate the proposed method. Of course, a backward elimination procedure and a step-wise regression procedure can also be adopted. Since the three methods have similar performance, we only report the results on forward regression.

Fig. 7
figure 7

Selection accuracies for different r based on UNIF (solid line with circles), LEVSS (dashed line with triangles), LEV (dotted line with squares) and IBOSS (dotdash line with plus signs) method for the six cases listed in Sect. 5.1. The models are selected via Lasso

In accordance with Sect. 5.1, we also demonstrate our method as well as the uniform subsampling, leveraging score subsampling and IBOSS through the six cases. The selection accuracies for the six cases are presented in Figure 5. The log MSPEs are also provided in Figure 6 to evaluate the performance of prediction.

From Figures 5, and 6, one can see that the forward regression results based on BIC  are very similar to the results on the all-subset regressions on BIC.

Additional simulation results on Lasso

Now, we will exam our method’s performance on model selection according to Lasso (Tibshirani 1996).

To be aligned with the settings described in Sect. 5.1, we also demonstrate our methods as well as the other three methods (i.e., uniform subsampling, leverage score subsampling, and the IBOSS) on the six cases listed at the beginning of Sect. 5.1 and evaluate the selection performance through the selection accuracies and MSPEs. The Lasso method is conducted through the glmnet package (Simon et al. 2011) and the tuning parameters are selected through 10-fold cross-validation according to the cv.glmnet() function. As for the leverage score subsampling, the Lasso is conduct as in Leng and Leung (2011).

Results on the selection accuracies are presented in Figure 7. It can be seen that the selection results based on Lasso are very similar to the results on the subset selection based on BIC.

To see the benefits of the model selection, we also report the log MSPEs in Figure 8 with \(n_\mathrm{test}=500.\) From Figure  8, we can clearly see that the IBOSS and our method (LEVSS) are uniformly performance better than the uniform subsampling method.

Fig. 8
figure 8

Log MSPEs for different r based on UNIF (solid line with circles), LEVSS (dashed line with triangles), LEV (dotted line with squares) and IBOSS (dotdash line with plus signs) methods for the six cases listed in Sect. 5.1. The models are selected via Lasso

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, J., Wang, H. Subdata selection algorithm for linear model discrimination. Stat Papers 63, 1883–1906 (2022). https://doi.org/10.1007/s00362-022-01299-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00362-022-01299-8

Keywords

Mathematics Subject Classification

Navigation