1 Introduction

Regression with high-dimensional data is nowadays a routine task in large variety of application areas, ranging from genomic analysis and medicine to finance and industrial processes. Perhaps the most popular method is the Lasso (Tibshirani 1996), which given a response \(Y \in {\mathbb {R}}^n\) and matrix of predictors \(X \in {\mathbb {R}}^{n \times p}\) solves

$$\begin{aligned} {{\,\mathrm{arg\,min}\,}}_{b \in {\mathbb {R}}^p} \left\{ \frac{1}{2n} \Vert Y - X b\Vert _2^2 + \lambda \Vert b\Vert _1 \right\} . \end{aligned}$$
(1)

Whilst the one of the goals of the Lasso and the many related penalised regression procedures is to determine which variables are relevant, in many settings, some vague prior information on the relative importance of the variables may be available.

One instance of this concerns the scaling of variables so they have the same empirical variance before performing the Lasso optimisation (1). This practice is very common and is carried out by default in many software packages including the highly popular glmnet (Friedman et al. 2010). The rationale for this is to ensure that all of the coefficients are treated in a balanced way; otherwise, coefficients corresponding to variables with large empirical variances will effectively experience very little shrinkage whereas those corresponding to variables with small variances will be penalised heavily. On the other hand, any information that may be encoded in the scale of the columns of X is lost. For example, in a setting where one may expect measurement error to be distributed evenly over the variables, it is reasonable to suspect that variables with larger observed variance will be less corrupted by the error and hence contain more underlying signal. Perhaps as a result of this, it is also common to remove columns with the smallest variance as part of pre-processing, a step known in the machine learning community as applying a ‘low variance filter’ (see for example Silipo et al. (2014), Singh et al. (2017), Abou Elhamayed (2018), Langkun et al. (2020), Saputra et al. (2018), Kalambe et al. (2020)). This is however a somewhat crude way of using this potential information, and may eliminate important variables that happen to have a small variation, a risk which the practice of scaling variables aims to mitigate.

Another example of potential prior information being available concerns regression on lagged times series data when estimating autoregressive processes. It is natural to assumed that the importance of the historical data typically decreases with increasing lag, though seasonality considerations may also be incorporated into a final ordering among the predictors. An a priori ordering or partial ordering on variables could arise for many other reasons, and in this paper we will focus on the more general question of how such information can be used to improve the final fit.

One approach to incorporating this information involves modifying the Lasso penalty term allowing individual tuning parameters \(\lambda _j\) to be applied to each variable, \(\sum _{j=1}^p \lambda _j | \beta _j|\), the idea being to place a smaller penalty on those coefficients believed to be more important. Manual reweighting of the penalty terms is computationally very attractive since it is no more difficult to compute than the Lasso (1), for which there exist very fast and reliable algorithms that include this functionality (Friedman et al. 2010). Nardi and Rinaldo (2011) study this approach in the autoregressive model setting. Motivated primarily by the time-lagged regression example, Tibshirani and Suo (2016) propose an alternative approach involving fitting the Lasso with a monotonicity constraint imposed on the coefficients with respect to a given ordering.

Whilst these approaches can take great advantage of an informative ordering, one potential drawback is that when the ordering is uninformative, their performance can suffer substantially. Indeed, the numerical results in Tibshirani and Suo (2016) suggest that the practitioner pays a large price for supplying a randomised ordering. Micchelli et al. (2010) consider a general norm-based penalty framework that includes, as an example, a penalty applied to groups (Yuan and Lin 2006) which respect a specified ordering but are data-dependent. Such an approach, while less aggressive than a monotonicity constraint on the coefficients or a reweighting of the penalty, can still suffer if the ordering is unhelpful. In addition, both this and the approach of Tibshirani and Suo (2016) are much more computationally involved than a regular Lasso regression, and less suited to the sort of large-scale settings we have in mind here.

In this work we propose to incorporate a potentially useful ordering over the variables in the following simple way, outlined in more detail in Section 2. We begin by fitting a model over the full set of variables and then proceed by fitting a sequence of nested submodels, each time removing a subset consisting of the least important variables according to our ordering. A final model is then selected by cross-validation. We show that when using ridge regression, by arranging computations appropriately, the cost for performing regressions on all p nested subsets of variables given by the ordering is the same as that of a single fit on the full set of variables. We also describe a strategy for speeding up computations when the Lasso is used.

The statistical performance of our nested regression approach depends on cross-validation being able to select a good model from among the candidates given by the ordering, which in the case where the ordering is uninformative will most likely be close to that obtained from regressing on the full model, but could come from a regression on a much smaller set of variables for an informative ordering. In Section 3 we give an oracle inequality bounding the performance of the coefficient estimate performing best on test data from among a sequence of estimates showing that the number of competing estimates only affects the performance logarithmically. In our context, these estimates will be derived from regressing on different subsets of variables, but the result may be of independent interest.

In Section 4 we present the results of numerical experiments on simulated and real data that demonstrate the utility of our nested regression method. We study settings where order information may arise through knowledge of the degree of measurement error in the covariates, heterogeneous levels of missingness among the variables, and chronological ordering among variables derived from time series. We conclude with a discussion and proofs are deferred to the Appendix. An R package implementing our proposed methodology is available at https://github.com/bgs25/OrderRegression.

2 Methodology

In this section we present our nested regression framework for incorporating potential prior ordering information when fitting high-dimensional regression models. We first present our general approach before describing specific versions for Lasso and ridge regression in high-dimensional linear models.

2.1 Using ordering information

Consider a general regression problem, with response vector \(Y \in {\mathbb {R}}^n\) and matrix of predictors \(X \in {\mathbb {R}}^{n \times p}\), of the form

$$\begin{aligned} {{\,\mathrm{arg\,min}\,}}_{b \in {\mathbb {R}}^p}\left\{ \ell (Y, Xb) + {\mathcal {P}}_\lambda (b)\right\} , \end{aligned}$$
(2)

where \(\ell \) is some loss function and \({\mathcal {P}}_\lambda \) a penalty indexed by a regularisation parameter \(\lambda \). Suppose that we have a reordering \(\pi _1,\ldots ,\pi _p\) of variables \(1,\ldots ,p\) with \(\pi _j\) expected to be at least as important a predictor as \(\pi _k\) when \(k \ge j\). We then specify indices \(p=j_1> j_2> \cdots > j_K=1\) from which we obtain a collection of nested subsets \(S_1 \supset \cdots \supset S_K\) with \(S_k = \{\pi _1,\ldots ,\pi _{j_k}\}\). When \(K \ll p\) we typically choose the \(k_j\) such that sizes of the subsets decrease exponentially, so \(|S_k|/|S_{k+1}| \approx |S_1|/|S_2|\) for \(k = 1, \ldots , K - 1\), and \(S_K = \{\pi _1\}\). This is partly motivated by the logarithmic cost associated with including extraneous variables in a Lasso regression.

Let us introduce the notation \([j]=\{1,\ldots ,j\}\) for \(j \in {\mathbb {N}}\), and for any nonempty \(S \subseteq [p]\), let \(X_S \in {\mathbb {R}}^{n \times |S|}\) be the submatrix of X consisting of the those columns indexed by S.

Given a grid of tuning parameters \(\lambda _1> \cdots > \lambda _L\), for each \(k \in [K]\) and \(l \in [L]\) we perform optimisation (2) with \(\lambda = \lambda _l\) and regressing only on those variables indexed by \(S_k\), i.e. with \(X_{S_k}\) in place of X, to give a vector of coefficient estimates \({\hat{\beta }}^{k,l} \in {\mathbb {R}}^p\) with \({\hat{\beta }}^{k,l}_j = 0\) for all \(j \notin S_k\). If our ordering were informative in that for some large k, \(S_k\) contained the set of important variables, regressions performed on \(S_k\) would be unhampered by the potentially large number of unimportant variables in \([p] \setminus S_k\). For example, it may be the case that conditions on the design matrix such as the irrepresentable condition (Zhao and Yu 2006) or the compatibility condition (Van de Geer and Bühlmann 2009) that guarantee the Lasso performs well are met by \(X_{S_k}\) but not the full design matrix X. Our new tuning parameter is the pair (kl), which we typically select by cross-validation.

One potential issue with the approach outlined above is that particularly if K is large, the computational burden of performing the \(K \times L\) regressions may be large. We now explain how for both ridge regression and sparsity inducing penalties such as the Lasso, the computation can be organised so the cost is manageable even with large-scale data.

2.2 Application to Lasso regression

We now consider nested regression using the Lasso (1) as our base regression procedure. Similarly to how warm starts greatly speed up the computation of a Lasso solution path compared to separately computing coefficient estimates at each \(\lambda _l\) (Friedman et al. 2010), we can utilise \(\{{\hat{\beta }}^{k,l}\}_{l=1}^L\) to speed up computation of \(\{{\hat{\beta }}^{k+1,l}\}_{l=1}^L\). The key observation is that if for a given (kl), \({\hat{\beta }}^{k,l}_j = 0\) for all \(j \notin S_{k+1}\), then \({\hat{\beta }}^{k+1,l} = {\hat{\beta }}^{k,l}\), so no further computation is required. The finer the grid (i.e. the larger K is), the more computation can be skipped and thus the greater the relative gains of using this are (as demonstrated in Fig. 1). This strategy can also be used for other penalised regression approaches that yield sparse solutions, such as the relaxed Lasso (Meinshausen 2007), the adaptive Lasso (Zou 2006) and MCP (Zhang 2010), or regression procedures using sparsity inducing penalty functions with losses other than squared error loss.

In the case of the Lasso, a further speedup may be realised by recognising that the bulk of the computation in the solution path occurs when computing the tail of path, i.e., those solutions with small values of the tuning parameter. In many settings, such solutions will not be needed as they will not be selected by cross-validation. The square-root Lasso offers an approach to eliminate these solutions without needing to compute them. The square-root Lasso (Belloni et al. 2011; Sun and Zhang 2012) is given by

$$\begin{aligned} {\hat{\beta }}_{\mathrm {sq}}(\lambda _{\mathrm {sq}}) \in {{\,\mathrm{arg\,min}\,}}_{b \in {\mathbb {R}}^p} \left\{ \Vert Y - X b\Vert _2/\sqrt{n} + \lambda _{\mathrm {sq}} \Vert b\Vert _1\right\} . \end{aligned}$$
(3)

By comparing the KKT conditions of the optimisation above and those of the Lasso, it may be shown that the solution paths of the Lasso (1) and the square-root Lasso are identical, but are parametrised differently. Specifically, writing \({\hat{\beta }}(\lambda )\) for a Lasso solution (1) with tuning parameter \(\lambda \) and \({\hat{\sigma }}(b) := \Vert Y - X b\Vert _2 / \sqrt{n}\) for \(b \in {\mathbb {R}}^p\), we have that \({\hat{\beta }}(\lambda )\) is in fact a square-root Lasso solution with tuning parameter \(\lambda _{\mathrm {sq}} = \lambda / {\hat{\sigma }}({\hat{\beta }}(\lambda ))\), provided \({\hat{\sigma }}({\hat{\beta }}(\lambda )) > 0\). Conversely, any square-root Lasso solution \({\hat{\beta }}_{\mathrm {sq}}(\lambda _{\mathrm {sq}})\) is also a Lasso solution with \({\hat{\beta }}(\lambda )\) with \(\lambda = \lambda _{\mathrm {sq}} {\hat{\sigma }}({\hat{\beta }}_{\mathrm {sq}}(\lambda _{\mathrm {sq}}))\), provided \({\hat{\sigma }}({\hat{\beta }}_{\mathrm {sq}}(\lambda _{\mathrm {sq}}))>0\). Furthermore, \(\lambda \mapsto \lambda / {\hat{\sigma }}({\hat{\beta }}(\lambda ))\) is a well-defined non-decreasing function; see Shah and P. Bühlmann (2019, App. B) for a derivation.

One key advantage of the square-root Lasso is that there exist theoretically rate-optimal choices of the tuning parameter, such as \(1.1\sqrt{2\log (p)/n}\), that do not depend on the unknown variance of the noise. In practice however, such theoretically motivated choices tend to be conservative in that they are too large, and are typically out-performed by cross-validation at least from the perspective of prediction error. However they can nevertheless be helpful as a criterion for terminating the computation of the Lasso solution path at a \(\lambda _l > \lambda _L\). Writing \(\lambda (\lambda _{\mathrm {sq}}) = \lambda _{\mathrm {sq}} {\hat{\sigma }}({\hat{\beta }}_{\mathrm {sq}}(\lambda _{\mathrm {sq}}))\) for a maximal \({\hat{\sigma }}({\hat{\beta }}_{\mathrm {sq}}(\lambda _{\mathrm {sq}}))\) for instanceFootnote 1, if we take \(\lambda _{\mathrm {sq}}\) to be say half of a theoretically optimal choice of tuning parameter (in our experiments we used 0.5 times the choice suggested in Sun and Zhang (2013)), cross-validation is unlikely to select any \(\lambda _l < \lambda (\lambda _{\mathrm {sq}})\), and so such solutions need not be computed.

Applying the two measures outlined above yields substantial reduction in the effort required to compute a full set of solution paths across all nested subsets, as shown in Fig. 1. Algorithm 1 below describes how our nested regression approach is implemented. The sets A(kl) record which variables have been included in the solution path for variable set \(S_k\) at \(\lambda = \lambda _l\).

figure a
Fig. 1
figure 1

Computation time of Algorithm 1 as a function of the number of subsets K when applied to the Riboflavin dataset (Dezeure et al. 2015), where \(n = 71\) and \(p = 4088\)

2.3 Application to ridge regression

We now turn to the case where ridge regression (Hoerl and Kennard 1970) is the base procedure in our nested regression approach. Given a tuning parameter \(\lambda > 0\), the solution on the full set of variables is given by

$$\begin{aligned} (X^TX + \lambda I )^{-1} X^T Y = X^T (XX^T + \lambda I)^{-1} Y, \end{aligned}$$
(4)

with the right-hand side computable in \(O(n^2p)\) operations in the case \(n \gg p\). We will consider the specific case where \(K=p\), and so without loss of generality, we may assume \(S_k = \{k,\ldots ,p\}\). We will show that for a fixed \(\lambda \) and test set \(Z \in {\mathbb {R}}^{n' \times p}\), computation of the full collection of test set predictions for ridge regression solutions across all \(S_1,\ldots ,S_p\) also requires only \(O(n^2p)\) operations, provided \(n' \le n\) (which we will from now assume). Applying this across a grid of L tuning parameter values then gives a total cost of \(O(Ln^2p)\) for determining the (kl) pair by cross-validation, when the number of folds is constant.

To achieve this, it will be convenient to compute predictions corresponding to \(S_p\) first, and then use this to obtain the predictions for \(S_{p-1}\), and hence \(S_{p-2}\) and so on, as we now explain. Let us fix \(k \le p\). The out-of-sample predictions corresponding to (4) applied to matrix of predictors \(X_{S_{k+1}} := X^{(1)}\) and test data \(Z_{S_{k+1}} := Z^{(1)}\) are given by

$$\begin{aligned} Z^{(1)}(X^{(1)})^T \{X^{(1)}(X^{(1)})^T + \lambda I\}^{-1} Y. \end{aligned}$$
(5)

Let us suppose that \(Z^{(1)}(X^{(1)})^T \in {\mathbb {R}}^{n' \times n}\) and \(A:=(X^{(1)}(X^{(1)})^T + \lambda I)^{-1} \in {\mathbb {R}}^{n\times n}\) have been computed; note these are required in calculating (5) which then needs only an additional \(O(n^2)\) operations to compute. We now explain how the corresponding quantities for \(X^{(2)} := (X_k,X^{(1)})=X_{S_k}\) and \(Z^{(2)} := (Z_k, Z^{(1)})=Z_{S_k}\) may then by retrieved in \(O(n^2)\) steps, where \(X_k \in {\mathbb {R}}^n\) and \(Z_k \in {\mathbb {R}}^{n'}\) are the kth columns of X and Z respectively. First observe that

$$\begin{aligned} Z^{(2)}(X^{(2)})^T = Z^{(1)}(X^{(1)})^T + Z_kX_k^T, \end{aligned}$$

so \(Z^{(2)}(X^{(2)})^T\) may be formed using \(O(nn')\) operations. Next applying the Sherman–Morrison–Woodbury rank one update formula, we see that

$$\begin{aligned}&\{X^{(2)}(X^{(2)})^T + \lambda I\}^{-1} = \{X^{(1)}(X^{(1)})^T + X_kX_k^T + \lambda I\}^{-1} \\&\qquad = \{X^{(1)}(X^{(1)})^T + \lambda I\}^{-1} - \frac{AX_k X_k^T A}{1 + X_k^T A X_k}, \end{aligned}$$

with right-hand side easily obtained from A in \(O(n^2)\) computations. We thus see that the entire set of solutions may be computed using \(O(pn^2)\) operations.

In practice we compute the full model first using the singular value decomposition of X to obtain the full solution path for this model, with an analogous rank-one-update to that above applied to progress to smaller models.

To illustrate the speed with which this enables the models to be computed, we again used the Riboflavin dataset (see Section 4.2). We trained the models on 50 observations and computed predictions for the remaining 21 across a sequence of 100 tuning parameter values and \(K=p\) variable subsets. Thus, the total number tuning parameter pairs considered was \(408\,800\); the full computation for all of these on a standard laptop took under 100 seconds. We obtained an estimated mean squared prediction error of 0.318 for the model fitted only on the top 169 nodes, compared to 0.399 for the full model.

3 Theory

Our proposed nested regression scheme for incorporating potential prior information in high-dimensional regression involves consideration of a number of different estimators corresponding to different tuning parameter values and subsets of variables. The statistical performance relies on cross-validation being able to select a good estimator from among these. In this section we study this general problem in the context of the high-dimensional linear model,

$$\begin{aligned} Y = X \beta + \varepsilon , \end{aligned}$$
(6)

for a simplified form of cross-validation. Specifically, we suppose that we have available candidate estimators \({\hat{\beta }}^{(1)}, \ldots , {\hat{\beta }}^{(M)} \in {\mathbb {R}}^p\) trained on data \((Y^{\mathrm {tr}}, X^{\mathrm {tr}})\), independent of test data \((Y, X) \in {\mathbb {R}}^n \times {\mathbb {R}}^{n \times p}\), which we will treat as fixed. In the context of the previous section, M would be LK. With these we compute

$$\begin{aligned} {\hat{\beta }} \in \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{b\in \{ {\hat{\beta }}^{(1)}, \ldots , {\hat{\beta }}^{(M)} \}} \Vert Y - X b \Vert _2^2. \end{aligned}$$
(7)

The candidate estimators could, for example, have been obtained via Lasso or ridge regression; however we consider a more general setting that allows for arbitrary candidate estimators.

Suppose that the rows of X are i.i.d. with mean zero and covariance \(\Sigma \in {\mathbb {R}}^{p \times p}\). We compare the performance of \({\hat{\beta }}\) to that of the unknown best estimator among \( {\hat{\beta }}^{(1)}, \ldots , {\hat{\beta }}^{(M)}\) defined by

$$\begin{aligned} {\hat{\beta }}^* \in \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{b \in \{{\hat{\beta }}^{(1)},\ldots ,{\hat{\beta }}^{(M)}\}} (\beta - b)^T\Sigma (\beta -b). \end{aligned}$$
(8)

While a large M can be expected to result in \({\hat{\beta }}^*\) and \(\beta \) being closer, naturally it will result in a larger discrepancy in the performance of \({\hat{\beta }}^*\) and \({\hat{\beta }}\). Theorem 1 however indicates that the price to pay is only logarithmic in M. The implication for our nested regression scheme is that K and L can be chosen to be relatively large subject to computational constraints; while they may not be optimal statistically, they will not be too far off.

Theorem 1

Suppose that \(X = W \Sigma ^{1/2}\) where \(W \in {\mathbb {R}}^{n \times p}\) has independent mean-zero sub-Gaussian entries with variance proxy \(\nu ^2\). Suppose linear model (6) holds with the components of \(\varepsilon \) mean-zero, independent and sub-Gaussian with variance proxy \(\sigma ^2\). Then for any \(c_1, c_2 >0\) with \(c_1<n / \log M - 1\) , we have that with probability at least \(1 - 2M^{-c_1} - 2M^{-c_2}\),

$$\begin{aligned} \Vert \Sigma ^{1/2} ( {\hat{\beta }} - \beta ) \Vert _2&\le \frac{1 + \Psi }{1 - \Psi } \Vert \Sigma ^{1/2}({\hat{\beta }}^* - \beta ) \Vert _2 + \frac{1}{1 - \Psi }\nonumber \\&2\sqrt{2}\sigma \sqrt{1 + c_2} \sqrt{\frac{\log M}{n}}, \end{aligned}$$
(9)

where \(\Psi = 2\sqrt{2}\nu (1 + c_1)^{1/4} \{(\log M) / n \}^{1/4}\).

Note that no sparsity assumptions on \(\beta \) are required, and p can be arbitrarily large compared to n. In addition, no requirements are placed on the quality of the candidate estimators which can be arbitrarily good or bad, though this will be reflected in \({\hat{\beta }}^*\) of course.

In practice we would use cross-validation rather than a single training and test split of the original data as considered in our setup here. However existing results on cross-validation are not available in the sort of generality we consider here; in particular they do require sparsity assumptions and are tied to particular estimators such as the Lasso (Feng and Yu 2019; Chetverikov et al. 2021).

We remark that another option for comparing the different models output could be AIC or BIC, which have the advantage of not requiring test data. However, such information criteria would not have a way of discriminating between the complexity a model fitted on r variables selected out of set \(S_k \subset [p]\), or one selected out of the full set of p variables \(S_K\). On the other hand, cross-validation implicitly provides a way of doing this: the model selected out of the full set of variables may have a greater degree of over-fitting which would be exposed by performance on the test set.

4 Numerical experiments

In this section we explore the properties of our nested regression approach in a range of scenarios, using both simulated and real data. In Section 4.1 we consider different levels of ‘informativeness’ in the orderings, and the effect that this has on the prediction error of the final model. Section 4.2 explores the effect of varying the number of subsets K using two real datasets. Sections 4.3 and 4.4 again use real data, this time exploring how our approach can be used with missing or corrupted data. Lastly, in Section 4.5 we run our method on a dataset to predict avocado prices, illustrating the use of our approach in a time series context. For our nested regression approach, we use the Lasso with 5-fold cross-validation and the default grid of 100 tuning parameters as chosen by glmnet on the full model \(S_1\).

Fig. 2
figure 2

Mean squared prediction errors (MSPE) of our nested regression method as the logarithm of the probability ratio \(\eta \) varies: 0 relates to a neutral choice of ordering, larger and negative means a more adversarially bad choice, larger and positive means a more informative choice. Left–Right: 5, 10, 25 signal variables; Top–Bottom: 0.5, 1.5, U[0, 2] signal coefficients. Colours for settings are 1. black, 2. red, 3. blue, 4. green. The dotted lines correspond to the errors achieved by the standard approach on the same data. (Color figure online)

4.1 Quality of ordering

In order to see the effect of different variable orderings on the performance of the model, we sample orderings weighted by a vector \(\rho \in [0,1]^p\) of probabilities using the sample function in R (R Core Team 2021) supplied with the \(\rho \) as the prob argument. This uses so-called size-biased sampling as described in Pitman and Tran (2015). For example, a neutral (or uninformative) ordering would have this vector as \(( 1/p, \ldots , 1/p)\) (where p is the number of variables), as all permutations are equally likely.

After a model has been constructed, with a true support S, we specify the jth entry of \(\rho \) for \(j = 1, \ldots , p\) by

$$\begin{aligned} \rho _j = {\left\{ \begin{array}{ll} \eta / (p + (\eta -1)|S|) &{}\text { if } j \in S \\ 1 / (p + (\eta -1)|S|) &{}\text { otherwise, } \end{array}\right. } \end{aligned}$$

where \(\eta > 0\) is the ‘probability ratio’. A choice of \(\eta > 1\) means that the ordering is more likely to favour signal variables (meaning the ordering is likely to be useful), whereas \(\eta <1\) means the ordering will prefer non-signal variables (meaning that the ordering will be actively unhelpful). The vector \(\rho \) is used as the weight vector for sampling a permutation which was then used as the ordering.

We consider a linear model (6) with design matrices sampled with \(n = 100\) and \(p = 1000\), with i.i.d. mean-zero Gaussian rows with covariance matrix \(\Sigma \). Tests were run with four different choices of \(\Sigma \):

  1. 1.
  2. 2.

    \(\Sigma _{jk} = 0.9^{|j - k|}\)

  3. 3.

    \((\Sigma ^{-1})_{jk} = 0.4^{|j - k| / 5}\) (\(\approx 0.833^{|j - k|}\))

  4. 4.

    .

Three different sparsities for \(\beta \) were used: 5, 10, and 25 variables, with the variables with non-zero coefficients selected uniformly at random. Three regimes for populating the non-zero entries in \(\beta \) were used: two constant (0.5 and 1.5), and one random, where coefficients are drawn as independent U(0, 2) random variables. We fixed the number of subsets K to be 100; in these examples, the results were relatively insensitive to the choice of K provided it was sufficiently large, a phenomenon expected to hold more generally given the result of Theorem 1. The tests were repeated 500 times with glmnet used to perform the regressions in the case with \(K=1\).

It is interesting to note in Fig. 2 that an adversarially bad ordering does not give rise to any worse performance than a neutral one. There is reason to believe that this should be preferable: in a setting where an ordering is not actively helpful (i.e. it is either neutral or adversarially poor) then we wish for our procedure to select the full model, \(S_1\). If the ordering is sufficiently bad that the increase in loss for the submodels is larger than the variance of the test error, there is a greater chance that \(S_1\) will be selected.

Fig. 3
figure 3

Average prediction errors for order regressions on riboflavin (left) and prostate (right) datasets for different K, with squared error and misclassification losses respectively. The error bar is \(\pm 2\) standard deviations and the dotted line depicts the prediction error of a standard \(\ell _1\)-penalised regression

4.2 Riboflavin and Prostate data

Tests were performed on the riboflavin dataset (available in R package hdi (Dezeure et al. 2015); \(n=71\), \(p=4088\)) and the prostate dataset (available in R package spls (Chun and Keleş 2010); \(n=102\), \(p=6033\)) to explore the performance improvements attained by using our nested regession approach. With the former dataset, which has a continuous response, we used a Lasso regression with squared error loss, and with the latter, which has a binary response, we used an \(\ell _1\)-penalised logistic regression. Prediction error was estimated by cross-validation error with 5-folds; within each of the folds cross-validation was also used to select the model. A range of values of K were used for each dataset shown by the ticks on the plots in Fig. 3. The ordering used for both of these was the one induced by the scales of the columns in the matrices of predictors. As mentioned in the introduction, this order is often used more directly in a low variance filtering step, particularly with gene expression data (Singh et al. 2017).

We see in both of these examples that the prediction error appears to improve after using a nested regression approach instead of ordinary Lasso models (which is equivalent to a \(K= 1\)). With the both the riboflavin data and the prostate data, the error then increases slowly as K increases. This behaviour is to be expected from (9) where the the bound on the right-hand-side is typically decreasing in \(M=LK\), whereas the final term is increasing in M.

Only 8 values of K were used for the prostate data due to computational constraints, as for logistic regression models we used a simple loop over \(\ell _1\)-penalised logistic regressions rather than the approach described in Section 2.2. Each test was repeated 2000 times which different splits used in the cross-validation schemes in each run.

4.3 Corrupted data

Here we consider a setting where an ordering among the variables arises through corruption of the entries of the design matrix. We used the ‘muscle-skeletal’ dataset from the GTEx projectFootnote 2, which has 491 rows and \(14\,713\) columns, as our uncorrupted dataset. We preprocessed the data as in Shah et al. (2020) by removing the effect of measured and estimated confounders. We took as a response variable a column randomly selected (anew in each run) from the matrix, meaning that for our experiment \(n = 491\) and \(p = 14\, 712\).

We then artificially corrupted the data by independently, for each j replacing with probability \(\rho _j\) each entry in jth column of the matrix of predictors by independent standard Gaussian random variables. In order to obtain an ordering we assume knowledge of the ranking of the \(\rho _j\). In practice, if this is not known then an estimate can still be useful, as even if a practitioner has only a very vague notion of which are more likely to be corrupted, we see in Section 4.1 that it can still be beneficial to use such an ordering.

Fig. 4
figure 4

Proportion of variance explained (larger is better) by the Lasso and our nested regression scheme incorporating ordering information for each of the corruption regimes

We tested performance in four settings, each with a different vector \(\rho \) controlling the corruption probabilities of the variables. We constructed \(\rho _1, \ldots , \rho _p\) for each of the four settings as follows (before randomising the order of \(\rho _1,\ldots , \rho _p\)):

  1. 1.

    \(\rho _j = 0.5\) for \(j = 1, \ldots , \left\lfloor 0.2 p \right\rfloor \) and 0 otherwise

  2. 2.

    \(\rho _j = 0.5\) for \(j = 1, \ldots , \left\lfloor 0.5 p \right\rfloor \) and 0 otherwise

  3. 3.

    \(\rho _j = 0.5\) for \(j = 1, \ldots , \left\lfloor 0.8 p \right\rfloor \) and 0 otherwise

  4. 4.

    \(\rho _j = \min \{ 0.95,(j-1)/ p \}\)

The data were split into five folds; for each of these a model was fitted on the complementary four with entries corrupted according the settings above. These models were themselves tuned using five-fold cross-validation. We set \(K=25\) for all experiments, which were repeated 2000 times.

4.4 Heterogeneous missing data

Here we consider the missing data setting, where in a given design matrix X, each entry \(X_{ij}\) is missing independently with probability \(\rho _j\). In contrast to the previous section, here it is observed exactly which entries are missing. This means that the ranking of the variables by their overall missingness is known, with no additional knowledge assumed.

Data are missing homogeneously in the case where \(\rho _j \equiv \rho \), i.e. the entries in X are all missing with equal probability. In this case, the probability \(\rho \) can be estimated and there are well-studied methods for computing Lasso solutions, such as discussed in Loh et al. (2012). However, the setting we consider here includes heterogeneous missing data. Within high-dimensional statistics there are methods that accommodate heterogeneous missingness in principal components analysis (Zhu et al. 2019) and in regression problems (Rosenbaum et al. 2013; Datta et al. 2017).

In order to implement our approach in this setting, we first observe that the Lasso objective can be written as

$$\begin{aligned} \frac{1}{2n} Y^T Y - \frac{1}{n} b^T X^T Y + \frac{1}{2n}b^T X^T X b + \lambda \Vert b \Vert _1 , \end{aligned}$$
(10)

which depends on X only through the vector \(X^T Y /n\) and matrix \(X^T X /n\). Indeed, this is the starting point of existing approaches for performing regression on high-dimensional data with missing entries (Loh et al. 2012; Rosenbaum et al. 2013; Datta et al. 2017). In the case where X has some missing entries, the above quantities can be estimated by

$$\begin{aligned}&{\hat{\Gamma }}_{jk} = {\tilde{X}}_j^T {\tilde{X}}_k / |\{ i : X_{ij} \text { and } X_{ik} \text { not missing}\}| \end{aligned}$$
(11)
$$\begin{aligned}&{\hat{\gamma }}_{j} = {\tilde{X}}_j^TY / |\{i : X_{ij}\text { not missing}\}|, \end{aligned}$$
(12)

where \({\tilde{X}}_{ij} = X_{ij}\) if \(X_{ij}\) not missing, and 0 otherwise. These quantities can be substituted into the update steps in computing solutions to the following surrogate objective function

$$\begin{aligned} {\hat{\beta }} \in {{\,\mathrm{arg\,min}\,}}_{b \in {\mathbb {R}}^p} \left\{ - b^T{\hat{\gamma }}_j + \frac{1}{2} b^T {\hat{\Gamma }}b + \lambda \Vert b \Vert _1\right\} . \end{aligned}$$
(13)

Note that \({\hat{\Gamma }}\) is not in general positive semidefinite, which would be needed in order for (13) to have a finite minimum. To overcome this difficulty, Datta et al. (2017) suggest projecting \({\hat{\Gamma }}\) onto the cone of positive semi-definite matrices via \({{\,\mathrm{arg\,min}\,}}_{\Gamma \in {\mathbf {S}}^p_{+}}\Vert \Gamma - {\hat{\Gamma }}\Vert _{\infty }\), which restores the convexity of the problem. Here we will use a simpler approach replacing \({\hat{\Gamma }}\) with \({\hat{\Gamma }}_{\text {psd}} := {\hat{\Gamma }} + \Lambda _{\text {min}}({\hat{\Gamma }})I_p\), where \(\Lambda _{\text {min}}({\hat{\Gamma }})\) is the smallest eigenvalue of \({\hat{\Gamma }}\). This is equivalent to the addition of a fixed ridge penalty term in the objective function

$$\begin{aligned} {\hat{\beta }}&\in {{\,\mathrm{arg\,min}\,}}_{\beta \in {\mathbb {R}}^p} \left\{ - \beta ^T{\hat{\gamma }}_j + \frac{1}{2} \beta ^T {\hat{\Gamma }}_{\text {psd}} \beta + \lambda \Vert \beta \Vert _1 \right\} \nonumber \\&= {{\,\mathrm{arg\,min}\,}}_{\beta \in {\mathbb {R}}^p} \left\{ - \beta ^T{\hat{\gamma }}_j + \frac{1}{2} \beta ^T {\hat{\Gamma }} \beta + \lambda \Vert \beta \Vert _1 \right. \nonumber \\&\qquad \left. + \frac{1}{2}\Lambda _{\text {min}}({\hat{\Gamma }})\Vert \beta \Vert _2^2 \right\} . \end{aligned}$$
(14)

Tests were performed on the same ‘muscle-skeletal’ dataset as in Section 4.3, again randomly selecting a variable in each replicate to use as the response. Three missing data regimes were used, each determined by a vector \(\rho \in [0,1]^{p}\) specifying the independent missingness probability of each variable:

  1. 1.

    \(\rho _j = 0.25\) for all j,

  2. 2.

    \(\rho _j = (j-1)/3p\), then randomising the order of \(\rho \),

  3. 3.

    \(\rho _j = 0.3\) for \(j = 1, \ldots , \left\lfloor 0.5p \right\rfloor \) and 0 otherwise, then randomising the order of \(\rho \).

Fig. 5
figure 5

Proportion of variance explained (larger is better) by the Lasso and our nested regression scheme incorporating ordering information for each of the missing data regimes.

The data were split into five folds; for each of these a model was fitted on the complementary four with missing entries according to the settings described. These models were themselves tuned using five-fold cross-validation, with scores computed using the estimates \({\hat{\Gamma }}\) and \({\hat{\gamma }}\). Experiments were repeated 250 times.

Figure. 5 displays the proportion of variance explained by the both the vanilla Lasso and by our approach with \(K=25\) in each of the above settings. As expected, there is no improvement in Setting 1 from using a nested regression approach, as the missingness is homogeneous so the ordering will be uniformly random. In the other two settings, using our approach with the knowledge of which variables are missing more frequently allows us to fit models that provide better predictions.

4.5 Avocado data

Here we study historical data on avocado prices and sales volume in multiple US markets available on Kaggle [Kiggins]. For this experiment we consider predicting the price using an autoregressive model, and use the 53 markets for which full weekly price data is available for both ‘conventional’ and ‘organic’ varieties from the beginning of January 2015 to the end of March 2018.

Fig. 6
figure 6

Mean squared prediction error (MSPE) across the models for our nested regressions with varying K, with the leftmost boxplot in each panel corresponding to the special case of the regular Lasso. Left and centre plots show the differences between the conventional and organic time series, the right plot includes results for both varieties

A design matrix was compiled using all 106 time series, using the previous 52 values for each of one, thus resulting in a 5512 (\( = 52 \times 106\))-dimensional model, with 117 observations. For each avocado variety and market, a model was fitted on the first 78 weeks of data and then tested on the remaining 39 weeks to assess performance. Unlike in the other experiments where the model was tuned using cross-validation, here we wish to respect the chronological ordering of the observations. We therefore train on the first 39 observations, then validate on the next 39 in order to select the model, falling within the set-up of Theorem 1. Once our model is selected we then retrain it on all 78 of the training observations before testing on the hitherto-unseen test set of the last 39 observations.

The ordering of variables we used with our nested regression scheme was motivated by the following considerations. For a univariate time series with no seasonal effects, we would typically order the variables by ranking them from most to least recent (with most recent being the most ‘important’). Here there are effectively 106 time series which are observed weekly, so there may be some seasonal effect.

The ordering used here was constructed by first splitting the time series into groups of decreasing ‘importance’ as follows:

  1. 1.

    the particular time series that we are modelling;

  2. 2.

    the complementary variety to the time series we are modelling (e.g. were we modelling Albany-conventional, here we would take Albany-organic);

  3. 3.

    everything else.

Within each of these groups, the reading from 52 weeks (one year) previous was first in the ordering, with the rest ordered from newest to oldest.

Each of the response vectors were scaled to have unit variance. Fig. 6 contains mean squared prediction error for each time series, fitted with both vanilla Lasso models, as well as our approach with \(K= 10\) and \(K=100\). In both cases, using our approach substantially improves the quality of the predictions and that \(K=10\) gives the largest improvement. This illustrates the flexibility of our approach with respect to the origin of the ordering over the variables, and how it can improve prediction performance in a range of scenarios. The median times for the 106 regressions were 1.32s for \(K=1\), 1.53s for \(K=10\) and 3.46s for \(K=100\).

5 Discussion

In this work we have introduced a simple and general nested regression approach for using potential prior information on the importance of variable in a regression problem. Such prior information may be available in a variety of settings of interest, including time series analysis, settings with missing data, and where some corruption of the data is suspected, among many others. We have described a simple computational strategy for implementing approach with the Lasso, and ridge regression, and provided in Theorem 1 some theoretical support for our scheme via a general result on selecting estimators in high-dimensional linear regression settings. It would be of interest to extend this result to encompass generalised linear models, for example, and also aggregation schemes such as stacking (Wolpert 1992). Another open question is whether there is a data-dependent choice of the number of subsets K that can be used.