Abstract
We investigate problems in penalized M-estimation, inspired by applications in machine learning debugging. Data are collected from two pools, one containing data with possibly contaminated labels, and the other which is known to contain only cleanly labeled points. We first formulate a general statistical algorithm for identifying buggy points and provide rigorous theoretical guarantees when the data follow a linear model. We then propose an algorithm for tuning parameter selection of our Lasso-based algorithm with theoretical guarantees. Finally, we consider a two-person “game” played between a bug generator and a debugger, where the debugger can augment the contaminated data set with cleanly labeled versions of points in the original data pool. We develop and analyze a debugging strategy in terms of a Mixed Integer Linear Programming (MILP). Finally, we provide empirical results to verify our theoretical results and the utility of the MILP strategy.
Similar content being viewed by others
1 Introduction
Modern machine learning systems are extremely sensitive to training set contamination. Since sources of error and noise are unavoidable in real-world data (e.g., due to Mechanical Turkers, selection bias, or adversarial attacks), an urgent need has arisen to perform automatic debugging of large data sets. Cadamuro et al. (2016), Zhang et al. (2018) proposed a method called “machine learning debugging” to identify training set errors by introducing new clean data. Consider the following real-world scenario: Company A collects movie ratings for users on a media platform, from which it learns relationships between features of movies and ratings in order to perform future recommendations. A competing company B knows A’s learning method and hires some users to provide malicious ratings. Company A could employ a robust method for learning contaminated data—but in the long run, it would be more effective for company A to identify the adversarial users and prevent them from submitting additional buggy ratings in the future. This distinguishes debugging from classical learning. The debugging problem also assumes that company A can hire an expert to help rate movies, from which it obtains a second trusted data set which is generally smaller than the original data set due to budget limitations. In this paper, we will study a theoretical framework for the machine learning debugging problem in a linear regression setting, where the main goal is to identify bugs in the data. We will also discuss theory and algorithms for selecting the trusted data set.
Our first contribution is to provide a rigorous theoretical framework explaining how to identify errors in the “buggy” data pool. Specifically, we embed a squared loss term applied to the trusted data pool into the extended Lasso algorithm proposed by Nguyen and Tran (2013), and reformulate the objective to better service the debugging task. Borrowing techniques from robust statistics (Huber and Ronchetti 2011; She and Owen 2011; Nguyen and Tran 2013; Foygel and Mackey 2014; Slawski and Ben-David 2017) and leveraging results on support recovery analysis (Wainwright 2009; Meinshausen and Yu 2009), we provide sufficient conditions for successful debugging in linear regression. We emphasize that our setting, involving data coming from multiple pools, has not been studied in any of the earlier papers.
The work of Nguyen and Tran (2013), Foygel and Mackey (2014) [and more recently, Sasai and Fujisawa (2020)] provided results for the extended Lasso with a theoretically optimal choice of tuning parameter, which depends on the unknown noise variance in the linear model. Our second contribution is to discuss a rigorous procedure for tuning parameter selection which does not require such an assumption. Specifically, our algorithm starts from a sufficiently large initial tuning parameter that produces the all-zeros vector as an estimator. Assuming the sufficient conditions for successful support recovery are met, this tuning parameter selection algorithm is guaranteed to terminate with a correct choice of tuning parameter after a logarithmic number of steps. Note that when outliers exist in the training data set, it is improper to use cross-validation to select the tuning parameter due to possible outliers in the validation data set.
Our third contribution considers how to design a second clean data pool, which is an important but previously unstudied problem in machine learning debugging. We consider a two-player “game” between a bug generator and debugger, where the bug generator performs adversarial attacks (Chakraborty et al. 2018), and the debugger applies Lasso-based linear regression to the augmented data set. On the theoretical side, we establish a sufficient condition under which the debugger can always beat the bug generator, and show how to translate this condition into a debugging strategy based on mixed integer linear programming. Our theory is only derived in the “noiseless” setting; nonetheless, empirical simulations show that our debugging strategy also performs well in the noisy setting. We experimentally compare our method to two other algorithms motivated by the machine learning literature, which involve designing two neural networks, one to correct labels and one to fit cleaned data (Veit et al. 2017); and a method based on semi-supervised learning that weights the noisy and clean datasets differently and employs a similarity matrix based on the graph Laplacian (Fergus et al. 2009).
The remainder of the paper is organized as follows: Sect. 2 introduces our novel framework for machine learning debugging using weighted M-estimators. Section 3 provides theoretical guarantees for recovery of buggy data points. Section 4 presents our algorithm for tuning parameter selection and corresponding theoretical guarantees. Section 5 discusses strategies for designing the second pool. Section 6 provides experimental results. Section 7 concludes the paper.
Notation We write \(\varLambda _{\min }(A)\) and \(\varLambda _{\max }(A)\) to denote the minimum and maximum eigenvalues, respectively, of a matrix A. We use Null(A) to denote the nullspace of A. For subsets of row and column indices S and T, we write \(A_{S,T}\) to denote the corresponding submatrix of A. We write \(\Vert A\Vert _{\max }\) to denote the elementwise \(\ell _\infty\)-norm, \(\Vert A\Vert _2\) to denote the spectral norm, and \(\Vert A\Vert _\infty\) to denote the \(\ell _\infty\)-operator norm. For a vector \(v \in \mathbb {R}^n\), we write \({\text {supp}}(v) \subseteq \{1, \dots , n\}\) to denote the support of v, and \(\Vert v\Vert _{\infty } = \max |v_i|\) to denote the maximum absolute entry. We write \(\Vert v\Vert _p\) to denote the \(\ell _p\)-norm, for \(p \ge 1\). We write \({\text {diag}}(v)\) to denote the \(n \times n\) diagonal matrix with entries equal to the components of v. For \(S \subseteq \{1, \dots , n\}\), we write \(v_S\) to denote the |S|-dimensional vector obtained by restricting v to S. We write [n] as shorthand for \(\{1, \dots , n\}\).
2 Problem formulation
We first formalize the data-generating models analyzed in this paper. Suppose we have observation pairs \(\{(x_i, y_i)\}_{i=1}^n\) from the contaminated linear model
where \(\beta ^*\in \mathbb {R}^p\) is the unknown regression vector, \(\gamma ^*\in \mathbb {R}^n\) represents possible contamination in the labels, and the \(\epsilon _i\)’s are i.i.d. sub-Gaussian noise variables with variance parameter \(\sigma ^2\). We also assume the \(x_i\)’s are i.i.d. and \(x_i \perp \!\!\! \perp \epsilon _i\). This constitutes the “first pool.” Note that the vector \(\gamma ^*\) is unknown and may be generated by some adversary. If \(\gamma ^*_i = 0\), the \(i\)th point is uncontaminated and follows the usual linear model; if \(\gamma ^*_i \ne 0\), the \(i\)th point is contaminated/buggy. Let \(T : = {\text {supp}}(\gamma ^*)\) denote the indices of the buggy points, and let \(t := |T|\) denote the number of bugs.
We also assume we have a clean data set which we call the “second pool.” We observe \(\{(\widetilde{x}_i,\widetilde{y}_i)\}_{i=1}^m\) satisfying
where the \(\widetilde{\epsilon }_i\)’s are i.i.d. sub-Gaussian noise variables with parameter \(\widetilde{\sigma }^2\). Let \(L := \frac{\sigma }{\widetilde{\sigma }}\), and suppose \(L \ge 1\). Unlike the first pool, the data points in the second pool are all known to be uncontaminated.
For notational convenience, we also use \(X \in \mathbb {R}^{n \times p}\), \(y \in \mathbb {R}^n\), and \(\epsilon \in \mathbb {R}^m\) to denote the matrix/vectors containing the \(x_i\)’s, \(y_i\)’s, and \(\epsilon _i\)’s, respectively. Similarly, we define the matrices \(\widetilde{X}\in \mathbb {R}^{m \times p}, \widetilde{y}\in \mathbb {R}^m\), and \(\widetilde{\epsilon }\in \mathbb {R}^m\). Note that \(\beta ^*, \gamma ^*, T, t\), and the noise parameters \(\sigma\) and \(\widetilde{\sigma }\) are all assumed to be unknown to the debugger. In this paper, we will work in settings where \(X^\top X\) is invertible.
Goal: Upon observing \(\{(x_i, y_i)\}_{i=1}^n\), the debugger is allowed to design m points \(\widetilde{X}\) in a stochastic or deterministic manner and query their corresponding labels \(\widetilde{y}\), with the goal of recovering the support of \(\gamma ^*\). We have the following definitions:
Definition 1
An estimator \(\widehat{\gamma }\) satisfies subset support recovery if \({\text {supp}}(\widehat{\gamma }) \subseteq {\text {supp}}(\gamma ^*)\). It satisfies exact support recovery if \({\text {supp}}(\widehat{\gamma }) = {\text {supp}}(\gamma ^*)\).
In words, when \(\widehat{\gamma }\) satisfies subset support recovery, all estimated bugs are true bugs. When \(\widehat{\gamma }\) satisfies exact support recovery, the debugger correctly flags all bugs. We are primarily interested in exact support recovery.
Weighted M-estimation Algorithm: We propose to optimize the joint objective
where the weight parameter \(\eta > 0\) determines the relative importance of the two data pools. The objective function applies the usual squared loss to the points in the second pool and introduces the additional variable \(\gamma\) to help identify bugs in the first pool. Furthermore, the \(\ell _1\)-penalty encourages \(\widehat{\gamma }\) to be sparse, since we are working in settings where the number of outliers is relatively small compared to the total number of data points. Note that the objective function (3) may equivalently be formulated as a weighted sum of M-estimators applied to the first and second pools, where the loss for the first pool is the robust Huber loss and the loss for the second pool is the squared loss (cf. Proposition 4 in Appendix A).
Lasso Reformulation: Recall that our main goal is to estimate (the support of) \(\gamma ^*\) rather than \(\beta ^*\). Thus, we will restrict our attention to \(\gamma ^*\) by reformulating the objectives appropriately. We first introduce some notation: Define the stacked vectors/matrices
where \(X' \in \mathbb {R}^{(m+n) \times p}\) and \(y', \epsilon ' \in \mathbb {R}^{m+n}\). For a matrix A, let \(P_A=A(A^\top A)^{-1}A^\top\) and \(P_A^\perp =I - A(A^\top A)^{-1}A^\top\) denote projection matrices onto the range of the column space of A and its orthogonal complement, respectively. For a matrix \(S \subseteq [n]\), let \(M_S\) denote the \((n+m) \times |S|\) matrix with ith column equal to the canonical vector \(e_{S(i)}\). Thus, right-multiplying by \(M_S\) truncates a matrix to only include columns indexed by S. We have the following useful result:
Proposition 1
The objective function
shares the same solution for \(\widehat{\gamma }\) with the objective function (3).
Proposition 1, proved in Appendix B, translates the joint optimization problem (3) into an optimization problem only involving the parameter of interest \(\gamma\). We provide a discussion regarding the corresponding solution \(\widehat{\beta }\) in Appendix A for the interested reader. Note that the optimization problem (5) corresponds to linear regression of the vector/matrix pairs \((P_{X'}^\perp y', P_{X'}^\perp M_{[n]})\) with a Lasso penalty, inspiring us to borrow techniques from high-dimensional statistics.
3 Support recovery
The reformulation (5) allows us to analyze the machine learning debugging framework through the lens of Lasso support recovery. The three key conditions we impose to ensure support recovery are provided below. Recall that we use \(M_T\) to represent the truncation matrix indexed by T.
Assumption 1
(Minimum Eigenvalue) Assume that there is a positive number \(b'_{\min }\) such that
Assumption 2
[Mutual Incoherence] Assume that there is a number \(\alpha ' \in [0,1)\) such that
Assumption 3
(Gamma-Min) Assume that
Assumption 1 comes from a primal-dual witness argument (Wainwright 2009) to guarantee that the minimizer \(\widehat{\gamma }\) is unique. Assumption 2 measures a relationship between the sets \(T^c\) and T, indicating that the large number of nonbuggy covariates (i.e., \(T^c\)) cannot exert an overly strong effect on the subset of buggy covariates (Ravikumar et al. 2010). To aid intuition, consider an orthogonal design, where \(X = \begin{bmatrix} c I_{[t], [p]} \\ c' I_{p \times p} \end{bmatrix}\) and \(\widetilde{X}= c'' I_{p \times p}\), for some \(t < p\), and \(c, c', c'' > 0\). We use the notation \(I_{[t], [p]}\) to denote a submatrix of \(I_{p \times p }\) with rows indexed by the set [t]. Suppose the first t points are bugs, and for simplicity, let \(\eta = m/n\). Then the mutual incoherence condition requires \(c < c' + \frac{(c'')^2}{c'}\), meaning that in every direction \(e_i\), the component of buggy data cannot be too large compared to the nonbuggy data and the clean data. Assumption 3 lower-bounds the minimum absolute value of elements of \(\gamma\). Note that \(\lambda\) is chosen based on \(\epsilon '\), so the right-hand expression is a function of \(\epsilon '\). This assumption indeed captures the intuition that the signal-to-noise ratio, \(\frac{\min _{i\in T}{|\gamma ^*_i|}}{\sigma }\), needs to be sufficiently large.
We now provide two general theorems regarding subset support recovery and exact support recovery.
Theorem 1
(Subset support recovery) Suppose \(P_{X'}^\perp\) satisfies Assumptions 1and 2. If the tuning parameter satisfies
then the objective (5) has a unique optimal solution \(\widehat{\gamma }\), satisfying \({\text {supp}}(\widehat{\gamma }) \subseteq {\text {supp}}(\gamma ^*)\) and \(\left\Vert \widehat{\gamma }- \gamma ^*\right\Vert _{\infty } \le G'\).
Theorem 2
(Exact support recovery) In addition to the assumptions in Theorem 1, suppose Assumption 3holds. Then we have a unique optimal solution \(\widehat{\gamma }\), which satisfies exact support recovery.
Note that we additionally need Assumption 3 to guarantee exact support recovery. This follows the aforementioned intuition regarding the assumption. In particular, recall that \(\epsilon\) and \(\widetilde{\epsilon }\) are sub-Gaussian vectors with parameters \(\sigma ^2\) and \(\sigma ^2/L\), respectively, where \(L \ge 1\) (i.e., the clean data pool has smaller noise). The minimum signal strength \(\min _{i\in T}{|\gamma ^*_i|}\) needs to be at least \(\varTheta (\sigma \sqrt{\log n})\), since \(\mathbb {E}\left[ \max _{i \in [n]} |\epsilon _i|\right] \le \sigma \sqrt{2 \log (2n)}\). Intuitively, if \(\min _{i\in T}{|\gamma ^*_i|}\) is of constant order, it is difficult for the debugger to distinguish between random noise and intentional contamination.
We now present two special cases to illustrate the theoretical benefits of including a second data pool. Although Theorems 1 and 2 are stated in terms of deterministic design matrices and error vectors \(\epsilon\) and \(\widetilde{\epsilon }\), the assumptions can be shown to hold with high probability in the example. We provide formal statements of the associated results in Appendix C.2 and Appendix C.3.
Example 1
(Orthogonal design) Suppose Q is an orthogonal matrix with columns \(q_1, q_2, \dots , q_p\), and consider the setting where \(X_T = RQ^\top \in \mathbb {R}^{t \times p}\) and \(X_{T^c} = FQ^\top \in \mathbb {R}^{p \times p}\), where \(R = \left[ {\text {diag}}(\{r_i\}_{i=1}^t) \mid \varvec{0}_{t\times (p-t)}\right]\) and \(F = {\text {diag}}(\{f_i\}_{i=1}^p)\). Thus, points in the contaminated first pool correspond to orthogonal vectors. Similarly, suppose the second pool consists of (rescaled) columns of Q, so \(\widetilde{X}= WQ^\top \in \mathbb {R}^{m \times p}\), where \(W = {\text {diag}}(\{w_i\}_{i=1}^p)\). (To visualize this setting, one can consider \(Q = I\) as a special case.) The mutual incoherence parameter is \(\alpha ' = \max _{1 \le i \le t}\left| \frac{r_i f_i}{f_i^2 + \eta \frac{n}{m}w_i^2}\right|\). Hence, \(\alpha ' < 1\) if the weight of a contaminated point dominates the weight of a clean point in any direction, e.g., when \(|r_i| > |f_i|\) and \(w_i = 0\); in contrast, if the second pool includes clean points \(w_iq_i\) with sufficiently large \(|w_i|\), we can guarantee that \(\alpha ' < 1\). Furthermore,
for some constant C. It is not hard to verify that \(G'\) decreases by adding a second pool. Further note that the behavior of the non-buggy subspace, \(\mathrm {span}\{q_{t+1}, \dots , q_p\}\), is not involved in any conditions or conclusions. Thus, our key observation is that the theoretical results for support recovery consistency only rely on the addition of second-pool points in buggy directions.
Example 2
(Random design) Consider a random design setting where the rows of X and \(\widetilde{X}\) are drawn from a common sub-Gaussian distribution with covariance \(\varSigma\). The conditions in Assumptions 1–3 are relaxed in the presence of a second data pool when n and m are large compared to p: First, \(b_{\min }'\) increases by adding a second pool. Second, \(\alpha ' \approx \frac{\Vert X_{T^c}\varSigma ^{-1}X_T\Vert _\infty }{n-t+\eta n}\), so the mutual incoherence parameter also decreases by adding a second pool. Third,
where \(X_T\) and \(X_{T^c}\) represent the submatrices of X with rows indexed by T and \(T^c\), respectively. Note that the one-pool case corresponds to \(\eta = 0\) and \(\left\| (I_{t\times t} - \frac{X_T\varSigma ^{-1}X_T^\top }{n+\eta n})^{-1}\right\| _\infty < \left\| (I_{t\times t} - \frac{X_T\varSigma ^{-1}X_T^\top }{n})^{-1}\right\| _\infty\), so if we choose \(\eta \le \frac{mL}{n}\), then \(G'\) decreases by adding a second pool. Therefore, all three assumptions are relaxed by having a second pool, making it easier to achieve exact support recovery.
We also briefly discuss the three assumptions with respect to the weight parameter \(\eta\): Increasing \(\eta\) always relaxes the eigenvalue and mutual incoherence conditions, so placing more weight on the second pool generally helps with subset support recovery. However, the same trend does not necessarily hold for exact recovery. This is because a larger value of \(\eta\) causes the lower bound (9) on \(\lambda\) to increase, resulting in a stricter gamma-min condition. Therefore, there is a tradeoff for selecting \(\eta\).
4 Tuning parameter selection
A drawback of the results in the previous section is that the proper choice of tuning parameter depends on a lower bound (9) which cannot be calculated without knowledge of the unknown parameters \((T, \alpha ', \epsilon ')\). The tuning parameter \(\lambda\) determines how many outliers a debugger detects; if \(\lambda\) is large, then \(\widehat{\gamma }\) contains more zeros and the algorithm detects fewer bugs. A natural question arises: In settings where the conditions for exact support recovery hold, can we select a data-dependent tuning parameter that correctly identifies all bugs? In this section, we propose an algorithm which answers this question in the affirmative.
4.1 Algorithm and theoretical guarantees
Our tuning parameter selection algorithm is summarized in Algorithm 1, which searches through a range of parameter values for \(\lambda\), starting from a large value \(\lambda _u\) and then halving the parameter on each successive step until a stopping criterion is met. The intuition is as follows: First, let \(\lambda ^*\) be the right-hand expression of inequality (9). Suppose that for any value in \(I = [\lambda ^*, 2 \lambda ^*]\), support recovery holds. Then given \(\lambda _u > \lambda ^*\), the geometric series \(\varLambda = \left\{ \lambda _u, \frac{\lambda _u}{2}, \frac{\lambda _u}{4}, \dots \right\}\) must contain at least one correct parameter for exact support recovery since \(\varLambda \cap I \ne \emptyset\), guaranteeing that the algorithm stops. As for the stopping criterion, let \(X_S\) denote the submatrix of X with rows indexed by S for \(T^c \subseteq S \subseteq [n]\). We have \(P_{X_S}^\perp \overset{|S| \rightarrow \infty }{\longrightarrow } \left( 1-\frac{p}{|S|)}\right) I\) under some mild assumptions on X, in which case \(P_{X_S}^\perp y_S \rightarrow \left( 1-\frac{p}{|S|}\right) (\gamma ^*_S + \epsilon _S)\). When \(\lambda\) is large and the conditions hold for subset support recovery but not exact recovery, we have \(S \cap T \ne \emptyset\), so
In contrast, when \(S = T^c\), we have
When \(\min |\gamma ^*_T|\) is large enough, the task then reduces to choosing a proper threshold to distinguish the error \(|\epsilon _{T^c}|\) from the bug signal \(|\gamma ^*_T|\), which occurs when the threshold is chosen between \(\max _i|\epsilon _i|\) and \(\min _{i\in T}|\gamma ^*_i| - \max _i|\epsilon _i|\).
With the above intuition, we now state our main result concerning exact recovery guarantees for our algorithm. Recall that the \(\epsilon _i\)’s are sub-Gaussian with parameter \(\sigma ^2\).
Let \(c_t := \frac{t}{n} < \frac{1}{2}\) denote the fraction of outliers. We assume knowledge of a constant \({\bar{c}}\) that satisfies \(c_t + {\mathbb {P}}[|\epsilon _i| \le {\bar{c}} \sigma ] < \frac{1}{2}\). Note that a priori knowledge of \({\bar{c}}\) is a less stringent assumption than knowing \(\sigma\), since we can always choose \({\bar{c}}\) to be close to zero. For instance, if we know the \(\epsilon _i\)’s are Gaussian, we can choose \({\bar{c}} < \mathrm {erf}^{-1}(\frac{1}{2} - c_t)\); in practice, we can usually estimate \(c_t\) to be less than \(\frac{1}{3}\), so we can take \({\bar{c}} = \mathrm {erf}^{-1}(\frac{1}{6})\). As shown later, the tradeoff is that having a larger value of \({\bar{c}}\) provides the desired guarantees under weaker requirements on the lower bound of \(\min _{i \in T} |\gamma ^*_i|\). Hence, if we know more about the shape of the error distribution, we can be guaranteed to detect bugs of smaller magnitudes. We will make the following assumption on the design matrix:
Assumption 4
There exists a \(p \times p\) positive definite matrix \(\varSigma\), with bounded minimum and maximum eigenvalues, such that for all \(X^{(k)}\) appearing in the while loop of Algorithm 1, we have
where \(l^{(k)}\) is the number of rows of the matrix \(X^{(k)}\) and c is a universal constant.
This assumption is a type of concentration result, which we will show holds w.h.p. in some random design settings in the following proposition:
Proposition 2
Suppose the \(x_i\)’s are i.i.d. and satisfy any of the following additional conditions:
- (a):
-
the \(x_i\)’s are Gaussian and the spectral norm of the covariance matrix is bounded;
- (b):
-
the \(x_i\)’s are sub-Gaussian with mean zero and independent coordinates, and the spectral norm of the covariance matrix is bounded; or
- (c):
-
the \(x_i\)’s satisfy the convex concentration property.
Then Assumption 4holds with probability at least \(1 - O(n^{-1})\).
The \(\varSigma\) matrix can be chosen as the covariance of X. In fact, Assumption 4 shows that \(P_{X^{(k)}}^\perp\) is approximately a scalar matrix. We now introduce some additional notation: For \(\nu > 0\), define \(c_\nu\) and \(C_\nu\) such that \(\nu = {\mathbb {P}}[|\epsilon _i| \le c_\nu \sigma ]\) and \(\nu = {\mathbb {P}}[|\epsilon _i| \ge C_\nu \sigma ]\). We write \(G'(\lambda )\) to denote the function of \(\lambda\) in the right-hand expression of inequality (8). Proofs of the theoretical results in this section are provided in Appendix D.
Theorem 3
Assume \(\nu\) is a constant satisfying \(\nu + c_t < \frac{1}{2}\). Suppose Assumption 4, the minimum eigenvalue condition, and the mutual incoherence condition hold. If
where C is an absolute constant, and
for some \(c_n \in (0,\frac{1}{2})\), then Algorithm 1 with inputs \({\bar{c}} < c_\nu\) and \(\lambda _u \ge \lambda ^*\) will return a feasible \(\widehat{\lambda }\) in at most \(\log _2\left( \frac{\lambda _u}{\lambda ^*}\right)\) iterations such that the Lasso estimator \(\widehat{\gamma }\) based on \(\widehat{\lambda }\) satisfies \({\text {supp}}(\widehat{\gamma }) = {\text {supp}}(\gamma ^*)\), with probability at least
Theorem 3 guarantees exact support recovery for the output of Algorithm 1 without knowing \(\sigma\). Note that compared to the gamma-min condition (8) with \(\lambda = \lambda ^*\), the required lower bound (12) only differs by a constant factor. In fact, the constant 2 inside \(G'(2\lambda ^*)\) can be replaced by any constant \(c > 1\), but Algorithm 1 will then update \({\hat{\lambda }}^k = {\hat{\lambda }}^{k - 1}/c\) and require \(\log _c \left( \frac{\lambda _u}{\lambda ^*}\right)\) iterations. Further note that larger values of \(c_t\) translate into a larger sample size requirement, as \(n = \varOmega \left( \frac{1}{1-c_t}\right)\) for \(c_n\) being close to 0. A limitation of the theorem is the upper bound on \(\Vert \gamma ^*\Vert _\infty\), where t needs to be smaller than n in a nonlinear relationship. Also, n is required to be \(\varOmega (p^2)\). These two conditions are imposed in our analysis in order to guarantee that \(P_{X_S}^\perp y_S \rightarrow \left( 1-\frac{p}{|S|}\right) (\gamma ^*_S + \epsilon _S)\). We now present a result indicating a practical choice of \(\lambda _u\):
Corollary 1
Define
Suppose Assumption 4, the minimum eigenvalue condition, and the mutual incoherence condition hold. Also assume conditions (11) and (12) hold when replacing \(\lambda ^*\) by \(\lambda (\sigma )\). Taking the input \(\lambda _u = \frac{2\Vert M_{[n]} P^\perp _{X'} y'\Vert _\infty }{n}\), Algorithm 1 outputs a parameter \(\widehat{\lambda }\) in \(O(\log n)\) iterations which provides exact support recovery, with probability at least
Note that \(\lambda _u\) can be calculated using the observed data set. Further note that the algorithm is guaranteed to stop after \(O(\log n)\) iterations, meaning it is sufficient to test a relatively small number of candidate parameters in order to achieve exact recovery.
5 Strategy for second pool design
We now turn to the problem of designing a clean data pool. In the preceding sections, we have discussed how a second data pool can aid exact recovery under sub-Gaussian designs. In practice, however, it is often unreasonable to assume that new points can be drawn from an entirely different distribution. Specifically, recall the movie rating example discussed in Sect. 1: The expert can only rate movies in the movie pool, say \(\{x_i\}_{i=1}^n\), whereas an arbitrarily designed \(\widetilde{x}\), e.g., \(\widetilde{x}= x_1 / 2\), is unlikely to correspond to an existing movie. Thus, we will focus on devising a debugging strategy where the debugger is allowed to choose points for the second pool which have the same covariates as points in the first pool.
In particular, we consider this problem in the “worst” case: suppose a bug generator can generate any \(\gamma ^*\in \Gamma := \{\gamma \in \mathbb {R}^n : {\text {supp}}(\gamma )| \le t\}\) and add it to the correct labels \(X\beta ^*\). We will also suppose the bug generator knows the debugger’s strategy. The debugger attempts to add a second data pool which will ensure that all bugs are detected regardless of the choice of \(\gamma ^*\). Our theory is limited to the noiseless case, where \(y = X\beta ^*+ \gamma ^*\) and \(\widetilde{y}= \widetilde{X}\beta ^*\); the noisy case is studied empirically in Sect. 6.3.3.
5.1 Preliminary analysis
We denote the debugger’s choice by \(\widetilde{x}_i = X^\top e_{\nu (i)}\), for \(i \in [m]\), where \(e_{\nu (i)} \in \mathbb {R}^{n}\) is a canonical vector and \(\nu : [m] \rightarrow [n]\) is injective. In matrix form, we write \(\widetilde{X}= X_{D}\), where \(D \subseteq [n]\) represents the indices selected by the debugger. Assume \(m < p\), so the debugger cannot simply use the clean pool to obtain a good estimate of \(\beta\). In the noiseless case, we can write the debugging algorithm as follows:
Similar to Proposition 1, given a \(\gamma\), we can pick \(\beta\) to satisfy the constraints, specifically \(\beta = \left( X^\top X + \widetilde{X}^\top \widetilde{X}\right) ^{-1} \left( X^\top (y-\gamma )+ \widetilde{X}^\top \widetilde{y}\right)\). Eliminating \(\beta\), we obtain the optimization problem
Before presenting our results for support recovery, we introduce some definitions. Define the cone set \({\mathbb {C}}(K)\) for some subset \(K \subseteq [n]\) and \(|K| = t\):
Further let \({\mathbb {C}}^A = \cup _{K \subseteq [n], |K| = t} {\mathbb {C}}(K)\), and define
Theorem 4
Suppose
Then a debugger who queries the points indexed by D cannot be beaten by any bug generator who introduces at most t bugs.
Theorem 4 suggests that Eq. (16) is a sufficient condition for support recovery for an omnipotent bug generator who knows the subset D. As a debugger, the consequent goal is to find such a subset D which makes Eq. (16) true. Whether such a D exists and how to find it will be discussed in Sect. 5.2.
Remark 1
When \(m = n\), we can verify that \(Null(\bar{P}(D)) = \{\mathbf {0}\}\), which implies that Eq. (16) always holds. Indeed, in this case, we can simply take \(\widetilde{X}= X\) and solve for \(\beta ^*\) explicitly to recover \(\gamma ^*\).
Remark 2
As stated in Theorem 4, Eq. (16) is a sufficient condition for support recovery. In fact, it is an if-and-only-if condition for signed support recovery: When Eq. (16) holds, \({{\,\mathrm{sign}\,}}(\widehat{\gamma }) = {{\,\mathrm{sign}\,}}(\gamma ^*)\); and when it does not hold, the bug generator can find a \(\gamma ^*\) with \({\text {supp}}(\gamma ^*) \le t\) such that \({{\,\mathrm{sign}\,}}(\widehat{\gamma }) \ne {{\,\mathrm{sign}\,}}(\gamma ^*)\).
Remark 3
We can also write \(Null(\bar{P}(D))\) as
Let \(\widehat{\beta }= \beta ^*+ v\) for some vector \(v \in \mathbb {R}^p\). From the constraint-based algorithm, we obtain
which implies that \(\widehat{\gamma }_T = \gamma ^*_T - X_Tv\) and \(\widehat{\gamma }_{T^c} = -X_{T^c} v ,\ X_D v = 0\). Let \(u = Xv\). Then we obtain \(\widehat{\gamma }= \gamma ^*- u\). As can be seen, Eq. (16) requires that \(u = \mathbf {0}\), which essentially implies \(\widehat{\gamma }= \gamma ^*\), and thus \({\text {supp}}(\widehat{\gamma }) = {\text {supp}}(\gamma ^*)\).
5.2 Optimal debugger via MILP
The above analysis is also useful in practice for providing a method for designing \(\widetilde{X}\). Consider the following optimization problem:
By Theorem 4 and Remark 3, we immediately conclude that if the problem (17) has the unique solution \((u,v) = (\mathbf {0}, \mathbf {0})\), then a debugger who queries the points indexed by D cannot be beaten by a bug generator who introduces at most t bugs.
Based on this argument, we can construct a bilevel optimization problem for the debugger to solve by further minimizing the objective (17a) with respect to \(D \subseteq [n]\) such that \(|D| \le m\). The optimization problem can then be transformed into a minimax MILP:
Theorem 5
(MILP for debugging) If the optimization problem (18) has the unique solution \((u,v) = (\mathbf {0}, \mathbf {0})\), then the debugger can add m points indexed by \(D = {\text {supp}}(\xi )\) to achieve support recovery.
Remark 4
For more information on efficient algorithms for optimizing minimax MILPs, we refer the reader to the references Tang et al. (2016), Xu and Wang (2014), Zeng and An (2014).
6 Experiments
In this section, we empirically validate our Lasso-based debugging method for support recovery. The section is organized as follows:
-
Section 6.1, corresponding to Sect. 3, contains a number of experiments which investigate the performance of our proposed debugging formulation.
-
Section 6.2, corresponding to Sect. 4, studies the proposed tuning parameter selection procedure.
-
Section 6.3 studies the Lasso-based debugging method with a clean data pool, including the proposed MILP algorithm from Sect. 5.
We also compare our proposed method to alternative methods motivated by existing literature.
We begin with an outline of the experimental settings used in most of our experiments:
-
S1
Generate the feature design matrix \(X \in \mathbb {R}^{n \times p}\) by sampling each row i.i.d. from \({\mathcal {N}}(\mathbf {0}_p, I_{p\times p})\).
-
S2
Generate \(\beta ^*\in \mathbb {R}^p\), where each entry \(\beta ^*_i\) is drawn i.i.d. from \(Unif(-1,1)\).
-
S3
Generate \(\epsilon \in \mathbb {R}^n\), where each entry \(\epsilon _i\) is drawn i.i.d. from \({\mathcal {N}}(0,\sigma ^2)\).
-
S4
Generate the bug vector \(\gamma ^*\in \mathbb {R}^n\), where we draw \(\gamma ^*_i = (10\sqrt{\log (2n)}\sigma + Unif(0,10))\cdot Bernoulli(\pm 1, 0.5)\) for \(i \in [t]\) and take \(\gamma ^*_i = 0\) for the remaining positions.
-
S5
Generate the labels by \(y = X\beta ^*+ \epsilon + \gamma ^*\).
These five steps produce a synthetic dataset (X, y); we will specify the particular parameters \((n,p,t, \sigma )\) in each task. If we use a real dataset, the first step changes to [S1’]:
-
S1’
Given the whole data pool \(X_{real}\), uniformly sample n data points from it to construct X.
In the plot legends, we will refer to our Lasso-based debugging method as “debugging.” We may also invoke a postprocessing step on top of debugging, called “debugging + postprocess,” which first runs the Lasso optimization algorithm to obtain \(\widehat{\gamma }\) and an estimated support set \({\hat{T}}\), then removes the points \((X_{{\hat{T}},\cdot }, y_{{\hat{T}}})\) and runs ordinary least squares on the remaining points to obtain \(\widehat{\beta }\).
6.1 Support recovery
In this section, we design two experiments. The first experiment investigates the influence of the fraction of bugs \(c_t:=\frac{t}{n}\) on the three assumptions imposed in our theory and the resulting recovery rates. We will vary the design of X using different datasets. The second experiment compares debugging with four alternative regression methods, using the precision-recall metric. Note that we will take the tuning parameter \(\lambda = 2\frac{\sqrt{\log 2(n-t)}}{n}\) for these experiments, since the other outlier detection methods we use for comparison do not propose a way to perform parameter tuning. We will explore the performance of the proposed algorithm for parameter selection in the next subsection.
6.1.1 Number of bugs versus different measurements
Our first experiment involves four different datasets with different values of n and \(c_t\). We track the performance of the three assumptions (Assumptions 1–3) and the subset/exact recovery rates, which measure the fraction of experiments which result in subset/exact recovery. The first dataset is generated using the synthetic mechanism described at the beginning of Sect. 6, with \(p=15\). The other three datasets are chosen from the UCI Machine Learning Repository: Combined Cycle Power Plant,Footnote 1 temperature forecast,Footnote 2 and YearPredictionMSD.Footnote 3 They are all associated to regression tasks, with varying feature dimensions (4, 21, and 90, respectively). In the temperature forecast dataset, we remove the attribute of station and date from the original dataset, since they are discrete objects. For each of the UCI datasets, after randomly picking n data points from the entire data pool, we normalize the subsampled dataset according to \(X_{\cdot ,j} = \frac{X_{\cdot ,j} - \frac{1}{n}\sum _{i \in [n]} X_{i,j}}{std[X_{\cdot ,j} ]}\), where std represents the standard deviation.
The results are displayed in Fig. 1. For the minimum eigenvalue assumption, a key observation from all datasets is that the minimum eigenvalue becomes larger (improves) as n increases, and becomes smaller as \(c_t\) increases. For the mutual incoherence assumption, the synthetic dataset satisfies the condition with less than 15% outliers. The Combined Cycle Power Plant dataset has mutual incoherence close to 1 when \(c_t\) is approximately 20%-25%, and the mutual incoherence condition of the YearPredictionMSD dataset approaches 1 when \(c_t\) is approximately 5%. Therefore, we see that the validity of the assumption highly depends on the design of X. For the gamma-min condition, as \(c_t\) increases, we need more obvious (larger \(\min _i|\gamma ^*_i|\)) outliers. Finally, with larger n and smaller \(c_t\), the subset/exact recovery rate improves.
6.1.2 Effectiveness for recovery
The second experiment compares our debugging method to other proposed methods in the robust statistics literature. We compare our method with the Fast LTS (Rousseeuw and Van Driessen 2006), E-lasso (Nguyen and Tran 2013), Simplified \(\varTheta\)-IPOD (She and Owen 2011), and Least Squares methods. E-lasso is similar to our formulation, except it includes an additional penalty with \(\beta\). The Simplified \(\varTheta\)-IPOD method iteratively uses hard thresholding to eliminate the influence of outliers. For the experimental setup, we generate synthetic data with \(n = 2000, t = 200, p = 15\), and \(\sigma = 0.1\), but replace step [S4] by one of the following mechanisms for generating \(\gamma ^*\):
-
1.
We generate \(\gamma ^*_i, i \in T\) by \(Bernoulli(\pm 1,0.5) \cdot (10\sqrt{\log (2n) \sigma } + Unif(0,10))\).
-
2.
We generate \(\beta '\) elementwise from \(Unif(-10,10)\) and take \(\gamma ^*_i = x_i^\top (\beta '-\beta ^*), i \in T\).
The first adversary is random, whereas the second adversary aims to attack the data by inducing the learner to fit another hyperplane. The precision/recall for Fast LTS and Least Squares are calculated by running the method once and applying various thresholds to clip \(\widehat{\gamma }\). For the other three methods, we apply different tuning parameters, compute precision/recall for each result, and finally combine them to plot a macro precision-recall curve.
In the left panel of Fig. 2, Least Squares and Fast LTS reach perfect AUC, while the other three methods have slightly lower scores. In the right panel of Fig. 2, we see that debugging, E-lasso, and Fast LTS perform comparably well, and slightly better than Simplified \(\varTheta\)-IPOD. Not surprisingly, Least Squares performs somewhat worse, since it is not a robust procedure.
6.2 Tuning parameter selection
We now present two experimental designs for tuning parameter selection. The first experiment runs Algorithm 1 for both one- and two-pool cases. We will present the recovery rates for a range of n’s and \(c_t\)’s, showing the effectiveness of our algorithm in a variety of situations. The second experiment compares Algorithm 1 in one- and two-pool cases to cross-validation, which is a popular alternative for parameter tuning. Our results indicate that Algorithm 1 outperforms cross-validation in terms of support recovery performance.
We begin by describing the method used to generate the second data pool. Given the first data pool (X, y) and the ground-truth parameters \((\beta ^*, \sigma )\), we describe two pipelines to generate the second pool. The first pipeline checks m random points of the first pool, with steps [T1-T3]:
-
T1
Select m points uniformly at random from the first pool to construct \(\widetilde{X}\) for the second pool.
-
T2
Generate \(\widetilde{\epsilon }\in \mathbb {R}^m\), where each entry \(\widetilde{\epsilon }_i\) is drawn i.i.d. from \({\mathcal {N}}(0,\sigma ^2/L)\).
-
T3
Generate the labels by \(\widetilde{y}= \widetilde{X}\beta ^*+ \widetilde{\epsilon }\).
When the debugger is able to query features of clean points from a distribution \({\mathcal {P}}_X\), we can use a second pipeline, where [T1] is replaced by [T1’]:
-
T1’
Independently draw m points from \({\mathcal {P}}_X\) to construct \(\widetilde{X}\).
6.2.1 Verification of Algorithm 1
We use the default procedure for generating the synthetic dataset, with parameters \(p = 15\), \(\sigma =0.1\), and \(t = c_t n\), where \(c_t\) ranges from 0.05 to 0.4 in increments of 0.05. In all cases, we input \({\bar{c}} = 0.2\) and \(\lambda _u = \frac{2\Vert P_{X}^\perp y \Vert _\infty }{n}\) in Algorithm 1.
Figure 3 displays the results for \(n \in \{1,2,3,4,5,10,20,30\}\cdot 10^3\). First, we see that Algorithm 1 achieves exact support recovery in all 20 trials in the yellow area. Second, the exact recovery rate increases with increasing n and decreasing \(c_t\), showing that the algorithm is particularly useful for large-scale data sets. This trend can also be seen from the requirement on n imposed in Theorem 3. In particular, we see that the contour curve for the exact recovery rate matches the curve of \(\left( 1-c_t\right) ^{-\frac{1}{1-2c_n}}\) for some constant \(c_n \in (0,\frac{1}{2})\). However, a downside of Algorithm 1 is that it does not fully take advantage of the second pool in the two-pool case, as the left panel and the right panel display similar results.
6.2.2 Effectiveness of tuning parameter selection
We now compare our method for tuning parameter selection to cross-validation. We also use the postprocessing step described at the beginning of the section. Four measurements are presented, including two recovery rates, the \(\ell _2\)-error of \(\widehat{\beta }\), and the runtime. In both the one- and two-pool cases, we use our default methods for generating synthetic data, and we set \({\bar{c}} = 0.2\) for all the experiments.
The cross-validation method for the one-pool case splits the dataset into training and testing datasets with the ratio of 8:2, then selects \(\lambda\) with the smallest test error, \(\Vert X_{test} \widehat{\beta }- y_{test}\Vert _2\). The procedure for the two-pool case is to run the Lasso-based debugging method with a list of candidate \(\lambda\)’s and test it on the second pool. Finally, we select the \(\lambda\) value with the smallest test error, \(\Vert \widetilde{X}\widehat{\beta }- \widetilde{y}\Vert _2\). We use 15 candidate values for \(\lambda\), spaced evenly on a log scale between \(10^{-6}\) and \(\lambda _u = \frac{2\Vert P_{X}^\perp y \Vert _\infty }{n}\).
Figure 4 compares the results in the one-pool case. We note that cross-validation does not perform very well for all the measurements except \(\Vert \widehat{\beta }- \beta ^*\Vert _2\). Specifically, it does not work at all for subset support recovery, since cross-validation tends to choose very small \(\lambda\) values. For the \(\ell _2\)-error, we see that for small values of \(c_t\), our algorithm can select a suitable choice of \(\lambda\), so that after removing outliers, we can fit the remaining points very well. This is why the debugging + postprecessing methods gives the lowest error. As \(c_t\) increases, our debugging method shows poorer performance in terms of support recovery, resulting in larger \(\ell _2\)-error for \(\widehat{\beta }\). Although cross-validation seems to perform well, carefully designed adversaries may still destroy the good performance of cross-validation, since its test dataset could be made to contain numerous buggy points.
Figure 5 displays the results for the two-pool experiments, which are qualitatively similar to the results of the one-pool experiments. We emphasize that our method works well for support recovery; furthermore, the methods exhibit comparable performance in terms of the \(\ell _2\)-error. The slightly larger error of our debugging method can be attributed to the bias which arises from using an \(\ell _1\)-norm instead of an \(\ell _0\)-norm.
6.3 Experiments with clean points
We now focus on debugging methods involving a second clean pool. We have three experimental designs: First, we study the influence of m on support recovery. Second, we compare debugging with alternative methods suggested in the literature. Third, we study the performance of our proposed MILP debugger, where we compare it to three other simple strategies. Different strategies for selecting clean points correspond to changing step [T1] in the setup described above.
6.3.1 Number of clean points versus exact recovery
In this subsection, we present two experiments involving synthetic and YearPredictionMSD datasets, respectively, to see how m affects the exact recovery rate. Recall that the pipeline for generating the first pool is described at the beginning of Sect. 6. For the second pool, we use steps [T’1, T2, T3] for the synthetic dataset, where we assume \({\mathcal {P}}_X\) is standard Gaussian. We take steps [T1–T3] for YearPredictionMSD to check the sample points in the first pool.
Recall that the YearPredictionMSD dataset is designed to predict the release year of a song from audio features. The dataset consists of 515,345 songs, each with 90 audio features. Therefore, for both experiments, we set \(n = 500, t = 50, p = 90, \sigma = 0.1\), and \(L = 10\), and take \(\lambda = 2.5\frac{\sqrt{\log (n-t)}}{n}\).
From Fig. 6, we see that the phenomena are similar for the two different design matrices. In particular, increasing the number of clean points helps with exact recovery. For instance, in the left subfigure, for \(m = 0\), when \(\min _i |\gamma ^*_i| > 2.9\), the exact recovery rate goes to 1. For \(m = 100\), the exact recovery rate goes to 1 when \(\min _i |\gamma ^*_i| > 2.4\). Also, the slope of the curve for larger m is sharper. Thus, adding a second pool helps relax the gamma-min condition.
6.3.2 Comparisons to methods with clean points
In this experiment, we compare the debugging method for two pools with other methods suggested by the machine learning literature. We generate synthetic data using the default first-pool setup with \(n = 1000, p = 15, t = 100\), and \(\sigma =0.1\), and we run [T1–T3] to generate the second pool using different values of m. For our proposed debugging method, we use Algorithm 1 to select the tuning parameter. We compare the following methods: (1) debugging + postprocessing, (2) least squares, (3) simplified noisy neural network, and (4) semi-supervised eigvec. The least squares solution is applied using \(\left\{ \begin{pmatrix} X \\ \widetilde{X}\end{pmatrix}, \begin{pmatrix} y \\ \widetilde{y}\end{pmatrix}\right\}\).
The simplified noisy neural network method borrows an idea from Veit et al. (2017), which is designed for image classification tasks for a datasets with noisy and clean points. This work introduced two kinds of networks and combines them together: the “Label Cleaning Network,” used to correct the labels, and the “Image Classifier,” which classifies images using CNN features as inputs and corrected labels as outputs. Each of them is associated with a loss, and the goal is to minimize the sum of the losses. Let \(w\in \mathbb {R}, \beta _1 \in \mathbb {R}^d\), and \(\beta _2 \in \mathbb {R}^d\) be the variables to be optimized. For our linear regression setting, we design the “Label Cleaning Network” by defining \({\hat{c}}_i = y_i w - x_i^\top \beta _1\) as the corrected labels for both noisy and clean datasets. Then we define the loss \({\mathcal {L}}_{clean} = \sum _{i \in cleanset} |{\tilde{y}}_i - y_i w - x_i^\top \beta _1|\). The “Image Classifier” is modified to the regression setting using predictions of \(x_i^\top \beta _2\) and the squared loss. Therefore, the classification loss can be formalized as \({\mathcal {L}}_{classify} = \sum _{i \in cleanset} (x_i^\top \beta _2 - {\tilde{y}}_i)^2 + \sum _{i \in noisy set} (x_i^\top \beta _2 - {\hat{c}}_i)\). Together, the optimization problem becomes
We use gradient descent to do the optimization, and initialize it with \(w = 0\) and \(\beta _1 = \beta _2 = \beta _{ls}\). The optimizer \(\widehat{\beta }_2\) is used for further predictions. We then calculate \(\widehat{\gamma }= y - X\widehat{\beta }_2\). For gradient descent, we will validate multiple step sizes and choose the one with the best performance on the squared loss of the clean pool.
The method “semi-supervised eigvec” is from Fergus et al. (2009), and is designed for the semi-supervised classification problem. It also contains an experimental setting that involves noisy and clean data. To further apply the ideas in our linear regression setting, we make the following modifications: Define the loss function as
where \(L = D-W (\varepsilon )\) is the graph Laplacian matrix and \(\varLambda\) is a diagonal matrix whose diagonal elements are \(\varLambda _{ii} = \lambda\) for clean points and \(\varLambda _{ii} = \frac{\lambda }{c}\) for noisy points. In the classification setting, \(f\in \mathbb {R}^{n+m}\) is to be optimized. The idea is to constrain the elements of f by injecting smoothness/similarity using the Laplacian matrix L. Since we assume the linear regression model, we can further plug in \(f = \begin{pmatrix} X \\ {\tilde{X}}\end{pmatrix}\beta\). Our goal is then to estimate \(\beta\) by minimizing \(J(\beta )\). As suggested in the original paper, we use the range of values \(\varepsilon \in [0,1,1,5], c \in [1,10,50]\), and \(\lambda \in [1,10,100]\). We will evaluate all 36 possible combinations and pick the one with the smallest squared loss on the clean pool.
The results are shown in Fig. 7. We observe that only the debugging method is effective for support recovery, as we have carefully designed our method for this goal. The method from Veit et al. (2017) works best in terms of \(\ell _2\)-error of \(\beta\), especially when m is large. The semi-supervised method, like least squares, does not perform well, possibly because it does not consider replacing/removing the influence of the noisy dataset.
6.3.3 Effectiveness on second pool design
We now provide experiments to investigate the design of the clean pool, corresponding to Sect. 5. We use the Concrete Slump dataset,Footnote 4 where \(p = 7\). We limit our study to small datasets, since the runtime of the MILP optimizer is quite long. We report the performance of the MILP debugging method in both noiseless and noisy settings. In our experiments, we compare the performance of the MILP debugger to a random debugger and a natural debugging method: adding high-leverage points into the second pool. In other words, D.milp selects m clean points to query from running the MILP (18); D.leverage selects the m points with the largest values of \(x_i^\top (X^\top X)^{-1}x_i\); and D.random randomly chooses m points from the first pool without replacement. After choosing the clean pool, the debugger applies the Lasso-based algorithm. In Zhang et al. (2018), all the second pool points are chosen either randomly or artificially. Therefore, we may consider D.random as an implementation of the method in Zhang et al. (2018), which will be compared to our D.milp.
In the noiseless setting, we define \(\beta ^*\) to be the least squares solution computed from all data points. We randomly select n data points as the \(x_i\)’s. For D.milp and D.leverage, since the bug generator knows their strategies or the selected D, it generates bugs according to the optimization problem (17). Let \(T \subseteq [n]\) be the index set of the t largest \(|u_i|\)’s, for \(i = 1,\dots , n\). The bug generator takes \(\gamma ^*_T = u_T\) if the solution u is nonzero, and otherwise randomly generates a subset T of size t to create \(\gamma ^*_T = \mathbf {1}\). Thus, \(y_i = x_i^\top \beta ^*+ \gamma ^*\). For D.onepool, the bug generator follows the above description with \(D = \emptyset\). The orange bars indicate whether the bug generator succeeds in exact recovery in the one-pool case. For D.random, the bug generator generates bugs using the same mechanism as for D.onepool. Note the above bug generating methods are the “worst” in the sense of signed support recovery: The debuggers run (14) using their selected \(X_D\). From Fig. 8, there is an obvious advantage of D.milp over D.onepool and D.leverage. This suggests improved performance of our MILP algorithm. D.random is sometimes successful even when n and t are small because the bug generator cannot control the randomness, but it performs worse than D.milp overall.
In the noisy setting, we define \(\beta ^*\) to be the least squares solution computed using the entire data set. We randomly select n data points as the \(x_i\)’s. For D.milp and D.leverage, since the bug generator knows their strategies or the selected D, it generates bugs via the optimization problem (17): taking \(\gamma ^*_T = u_T\) if the solution u is nonzero for T being the indices of the largest t elements of |u|, and otherwise randomly generating a subset T of size t to create \(\gamma ^*_T = \mathbf {1}\). Thus, \(y_i = x_i^\top \beta ^*+ \gamma ^*+ {\mathcal {N}}(0,0.01)\). Note that having \(\gamma ^*_T = u_T\) if the solution u is nonzero gives incorrect signed support recovery, which is proved in Appendix E.1. This is related to what we have claimed in Remark 2 above. For D.onepool, the bug generator follows the above description with \(D = \emptyset\). The orange bars indicate whether the bug generator succeeds in exact recovery in the one-pool case. For D.random, since it is not deterministic, the bug generator does not know D and acts in the same way as in the one-pool case. Note that the above bug generating methods are the “worst” in the sense of signed support recovery. From Fig. 9, there is an obvious advantage of D.milp over D.onepool and D.leverage. Our theory only guarantees the success of D.milp in the noiseless setting, so the experimental results for the noisy setting are indeed encouraging.
Debugging in practice: The algorithm for minimax optimization has been executed by running all \(n\atopwithdelims ()m\) possible choices of clean points for the outer loop; for each outer loop, we then run the inner maximization. For optimal debugging in practice, i.e., n, t, and m being large, some recent work provides methods for efficiently solving the minimax MILP (Tang et al. 2016). Note that the MILP debugger can be easily combined to other heuristic methods: one can run the MILP, and if there is a nonzero solution, we can follow it to add clean points. Otherwise, we can switch to other methods, such as choosing random points or high-leverage points.
7 Conclusion
We have developed theoretical results for machine learning debugging via M-estimation and discussed sufficient conditions under which support recovery may be achieved. As shown by our theoretical results and illustrative examples, a clean data pool can assist debugging. We have also designed a tuning parameter algorithm which is guaranteed to obtain exact support recovery when the design matrix satisfies a certain concentration property. Finally, we have analyzed a competitive game between the bug generator and the debugger, and analyzed a mixed integer optimization strategy for the debugger. Empirical results show the success of the tuning parameter algorithm and proposed debugging strategy.
Our work raises many interesting future directions. First, the question of how to optimally choose the weight parameter \(\eta\) remains open. Second, although we have mentioned several efficient algorithms for bilevel mixed integer programming, we have not performed a thorough comparison of these algorithms for our specific problem. Third, although our MILP strategy for second pool design has been experimentally found to be effective in a noisy setting, we do not have corresponding theoretical guarantees. Fourth, our proposed debugging strategy is a one-shot method, and designing adaptive methods for choosing the second pool constitutes a fascinating research direction. Finally, the analysis of our tuning parameter algorithm suggests that a geometrically decreasing series might be used as a grid choice for more general tuning parameter selection methods, e.g., cross validation—in practice, one may not need to test candidate parameters on a large grid chosen linearly from an interval. Lastly, it would be very interesting to extend the ideas in this work to regression or classification settings where the underlying data do not follow a simple linear model.
The supplmentary materials is organized as follows: Sect. A presents some additional discussions on \(\beta \). Sections B, C, D and E mainly provide proofs respectively for problem reformulation and support recovery, tuning parameter selection and strategy for second pool selection. They may also include additional discussions and formal statements as referred in the main text.
References
Adamczak, R. (2015). A note on the Hanson-Wright inequality for random vectors with dependencies. Electronic Communications in Probability, 20, 1–13.
Cadamuro, G., Gilad-Bachrach, R., & Zhu, X. (2016). Debugging machine learning models. In ICML Workshop on Reliable Machine Learning in the Wild.
Chakraborty, A., Alam, M., Dey, V., Chattopadhyay, A., & Mukhopadhyay, D. (2018). Adversarial attacks and defences: A survey. arXiv preprint arXiv:1810.00069
Fergus, R., Weiss, Y., & Torralba, A. (2009). Semi-supervised learning in gigantic image collections. In NIPS (Vol. 1, p. 2), Citeseer.
Foygel, R., & Mackey, L. (2014). Corrupted sensing: Novel guarantees for separating structured signals. IEEE Transactions on Information Theory, 60(2), 1223–1247.
Henderson, H. V., & Searle, S. R. (1981). On deriving the inverse of a sum of matrices. SIAM Review, 23(1), 53–60.
Hoeffding, W. (1994). Probability inequalities for sums of bounded random variables. In The Collected Works of Wassily Hoeffding (pp. 409–426), Springer.
Horn, R. A., & Johnson, C. R. (1994). Topics in matrix analysis. Cambridge University Press.
Huber, P., & Ronchetti, E. (2011). Robust statistics. Wiley Series in Probability and Statistics. Wiley.
Hwang, S. G. (2004). Cauchy’s interlace theorem for eigenvalues of Hermitian matrices. The American Mathematical Monthly, 111(2), 157–159.
Meinshausen, N., & Yu, B. (2009). Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics, 37(1), 246–270.
Nguyen, N. H., & Tran, T. D. (2013). Robust Lasso with missing and grossly corrupted observations. IEEE Transactions on Information Theory, 4(59), 2036–2058.
Ravikumar, P., Wainwright, M. J., & Lafferty, J. D. (2010). High-dimensional Ising model selection using \(\ell _1\)-regularized logistic regression. The Annals of Statistics, 38(3), 1287–1319.
Rousseeuw, P. J., & Van Driessen, K. (2006). Computing LTS regression for large data sets. Data Mining and Knowledge Discovery, 12(1), 29–45.
Sasai, T., & Fujisawa, H. (2020). Robust estimation with Lasso when outputs are adversarially contaminated. arXiv preprint arXiv:2004.05990
Seber, G. A. F. (2008). A matrix handbook for statisticians (Vol. 15). Wiley.
She, Y., & Owen, A. B. (2011). Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association, 106(494), 626–639.
Slawski, M., & Ben-David, E. (2017). Linear regression with sparsely permuted data. arXiv preprint arXiv:1710.06030
Tang, Y., Richard, J. P. P., & Smith, J. C. (2016). A class of algorithms for mixed-integer bilevel min–max optimization. Journal of Global Optimization, 66(2), 225–262.
Veit, A., Alldrin, N., Chechik, G., Krasin, I., Gupta, A., & Belongie, S. (2017). Learning from noisy large-scale datasets with minimal supervision. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 839–847).
Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027
Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science (Vol. 47). Cambridge University Press.
Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using \(\ell _1\)-constrained quadratic programming (Lasso). IEEE Transactions on Information Theory, 55(5), 2183–2202.
Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint (Vol. 48). Cambridge University Press.
Xu, P., & Wang, L. (2014). An exact algorithm for the bilevel mixed integer linear programming problem under three simplifying assumptions. Computers & Operations Research, 41, 309–318.
Zeng, B., & An, Y. (2014). Solving bilevel mixed integer program by reformulations and decomposition. Optimization Online (pp. 1–34).
Zhang, X., Zhu, X., & Wright, S. (2018). Training set debugging using trusted items. In Proceedings of the AAAI conference on artificial intelligence (Vol. 32).
Funding
Funding was provided by National Science Foundation: NSF (Grant Numbers DMS-1749857, CCF-1740707).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editors: Annalisa Appice, Sergio Escalera, Jose A. Gamez, Heike Trautmann.
Appendices
A Additional discussions
We present more miscellaneous discussions here to readers who may care about \(\beta\).
Debugging connection to \(\beta\). Throughout this paper, we have focused on estimating \(\gamma\) for the purpose of debugging. A result concerning how the second pool can be used to obtain a better estimate of \(\beta\) is as follows:
Proposition 3
Let \(X = USV^\top\) and \(\widetilde{X}= \widetilde{S}V_0^\top\). Let \(m < p\). It holds that
where \(z_{\widehat{\gamma }}\) is the subgradient of \(\Vert \widehat{\gamma }\Vert _1\).
Proof of Proposition 3
Recall the objective function (3) is
By KKT conditions of the objective function,
Plug \(y = X\beta ^*+ \gamma ^*+ \epsilon\) and \(\widetilde{y}= \widetilde{X}\beta ^*+ \widetilde{\epsilon }\) into (20) we obtain
Mutiply \(X^\top\) on (21b) and plug it into (21a) we get
Given that \(\widetilde{X}= \widetilde{S}V_0^\top\),
Plugging into the SVD of \(X = US V^\top\), we have
with probability at least \(1 - \exp (-cm)\). The second step is because \(\widetilde{\sigma }\) has subgaussian parameter \(\sigma ^2/L\). \(\square\)
Note that when \(\widetilde{S}\) is chosen large enough, then \(\Vert V_0(\widehat{\beta }- \beta ^*)\Vert _2\) is controlled to a small number. Besides, if the subspace \(V_0\) contains the buggy subspace of \(X_T\), then \(\Vert y_T-y_T^*\Vert _2\) is well controlled and we can spot the contaminated points. This, together with the orthogonal design we will discuss in Sect. C.2, suggests that a successful debugging strategy may be obtained by producing a carefully chosen interaction between the non-buggy subspace (augmented using a second pool of clean data points) and the buggy subspace.
Related work She and Owen (2011). Without the second pool, She and Owen (2011) demonstrated the equivalence of the solution \(\widehat{\beta }\) to the joint optimization of the objective (3) over \((\beta , \gamma )\) to the optimum of a regression M-estimator in \(\beta\) with the Huber loss. This motivates the question of whether the optimizer \(\widehat{\beta }\) of the objective (3) may similarly be viewed as the optimum of an M-estimation problem.
Proposition 4
The solution \(\widehat{\beta }\) of the joint optimization problem (3) is the unique optimum of the following weighted M-estimation problem:
Proof
Recall the definition of the Huber loss function:
We will show the desired equivalence via the KKT conditions for both objective functions. Taking gradients with respect to \(\beta\) and \(\gamma\) for the original objective function (3), we obtain the following system of equations:
The second equation (25) has a unique solution, given by the soft-thresholding function:
where for scalars \(u, k \in \mathbb {R}\), we have
and \({\text {SoftThresh}}_k\) acts on vectors componentwise. Plugging back into Eq. (24), we obtain
We now consider the KKT conditions for the weighted M-estimator (23). Taking a gradient with respect to \(\beta\), we obtain
The key is to note that
so
from which we may infer the equivalence of Eqs. (26) and (27). This concludes the proof. \(\square\)
The proposition also illustrates that the objective uses Huber loss to get the robust estimation \(\widehat{\beta }\), and then imply the estimation \(\widehat{\gamma }\). Therefore, estimations of \(\beta\) and \(\gamma\) complement each other. Our reformulation more relies on giving a direct analysis of \(\gamma\) and its support.
B Appendix for Sect. 2
We show reformulation of the objective function in this section.
Proof of Proposition 1
Using the notation (4), we can translate (3) into
First note that we can split \(y'-X'\beta -\begin{bmatrix}\gamma \\ \mathbf {0}_m\end{bmatrix}\) into two parts by projecting onto the column space of \(X'\) and the perpendicular space:
For any value of \(\widehat{\gamma }\), we can choose \(\widehat{\beta }\) such that \(\left\| P_{X'}\left( y' - X'\widehat{\beta }- \begin{bmatrix}\gamma \\ \mathbf {0}_m\end{bmatrix}\right) \right\| _2^2 = 0\), simply by taking \(\widehat{\beta }= (X'^\top X')^{-1} X'^\top \left( y' - \begin{bmatrix}\widehat{\gamma }\\ \mathbf {0}_m\end{bmatrix}\right)\). Hence, we get
and (28) becomes
Therefore, the two optimization problems share the same solution for \(\widehat{\gamma }\). \(\square\)
C Appendix for Sect. 3
Notations in appendix: We write \(P_{X',TT}^\perp\) to represent the submatrix of \(P_{X'}^\perp\) with rows and column indexed by T. We write \(P_{X',T\cdot }^\perp\) to represent the submatrix of \(P_{X'}^\perp\) with rows indexed by T and \(P_{X',\cdot T}^\perp\) to represent the submatrix of \(P_{X'}^\perp\) with columns indexed by T. For simplicity, let \(\bar{P}= P_{X'}^\perp M_{[n]}\). We slightly abuse notation by using \(\bar{P}_{T}\) and \(\bar{P}_{T^c}\) to denote \(\bar{P}_{\cdot T}\) and \(\bar{P}_{\cdot T^c}\), respectively.
In this appendix, we provide proofs and additional details for the results in Sect. 3. The proofs for fixed design are in Sect. C.1. We discuss orthogonal design in Sect. C.2 and sub-Gaussian design in Sect. C.3. In particular, we use the two special designs to better understand the three assumptions and see how having a clean pool helps with the support recovery. We will call one-pool case the setting with only contaminated pool and call two-pool case the setting with both data pools.
1.1 C.1 Proofs of Theorem 1 and Theorem 2
Proof of Theorem 1
We follow the usual Primal Dual Witness argument for support recovery in linear regression, which contains the following steps (Wainwright 2009):
-
1.
Set \(\widehat{\gamma }_{T^c} = 0\).
-
2.
Solve the oracle subproblem for \((\widehat{\gamma }_T, {\hat{z}}_{T})\):
$$\begin{aligned} \widehat{\gamma }_T \in \arg \min _{\gamma \in \mathbb {R}^t} \left\{ \frac{1}{2n}\left\Vert Ay' - B\gamma \right\Vert _2^2 + \lambda \left\Vert \gamma \right\Vert _1\right\} , \end{aligned}$$(29)and choose \({\hat{z}}_T \in \partial {\left\Vert \widehat{\gamma }_T\right\Vert _1}\). In the one data pool case, we have \(A = P_{X,\cdot T}^{\perp }\) and \(B = P_{X,\cdot T}^{\perp }\); in the two data pool case, we have \(A = P_{X',\cdot T}^\perp\) and \(B = \bar{P}_T\).
-
3.
Solve \({\hat{z}}_{T^c}\) via the zero-subgradient equation, and check whether the strict dual feasibility condition holds: \(\left\Vert \widehat{z}_{T^c}\right\Vert _{\infty } < 1\).
As in the usual Lasso analysis (Wainwright 2009), under the eigenvalue condition (6), \((\widehat{\gamma }_T, 0) \in \mathbb {R}^n\) is the unique optimal solution of the Lasso, where \(\widehat{\gamma }_T\) is the solution obtained by solving the oracle subproblem (29).
The focus of our current analysis is to verify the conditions under which the strict dual feasibility condition holds. The KKT conditions for Eq. (5) may be rewritten as
where \({\hat{z}}_T \in \partial \left\Vert \widehat{\gamma }_T\right\Vert _1, {\hat{z}}_{T^c} \in \partial \left\Vert \widehat{\gamma }_{T^c}\right\Vert _1\).
We will use the following equations to simplify terms later:
Since \(\bar{P}_T^\top \bar{P}_T\) is invertible by condition (6), we can multiply Eq. (30) by \(\left( \bar{P}_T^\top \bar{P}_T\right) ^{-1}\) on the left to obtain
Plugging this into Eq. (31), we then obtain
or
We need to show that \(\Vert {\hat{z}}_{T^c}\Vert _\infty < 1\).
Note that condition (7) gives us
Furthermore, since
we have
Combining these inequalities, we obtain strict dual feasibility:
In addition, applying the triangle inequality to the RHS of Eq. (32), we obtain
This concludes the proof. \(\square\)
Proof of Theorem 2
Note that
where the last inequality uses Theorem 1. Thus, if condition (8) also holds, we have
concluding the proof. \(\square\)
1.2 C.2 Orthogonal design
1.2.1 C.2.1 Main results for orthogonal design
In this section, we focus on a special case, where our data have an orthogonal property. Let \(X = \begin{bmatrix} RQ^\top \\ FQ^\top \end{bmatrix} \in \mathbb {R}^{(t+p) \times p}, \widetilde{X}= WQ^\top \in \mathbb {R}^{p \times p}\), where Q is an orthogonal matrix with columns \(q_1, q_2, \cdots , q_p\), \(F,\, W\) are diagonal matrices with diagonals \(f_i\)’s and \(w_i\)’s separately (\(i \in [p]\)), and \(R = \begin{bmatrix} \begin{matrix} r_1 & \quad 0 & \quad 0 & \quad 0 \\ 0 & \quad r_2 & \quad 0 & \quad 0\\ 0 & \quad 0 & \quad \cdots & \quad 0\\ 0 & \quad 0 & \quad 0 & \quad r_t \end{matrix} & \Big| & \text{0 }_{t\times (p-t)} \\ \end{bmatrix}.\) We assume for all \(i \in [p]\), \(r_i \ne 0, f_i \ne 0\). Consider the first t points are buggy and the rest p points are nonbuggy, i.e., \(X_T = RQ^\top \in \mathbb {R}^{t \times p}, X_{T^c} = FQ^\top \in \mathbb {R}^{p \times p}\).
Applying Theorems 1 and 2, we obtain Propositions 5 and 6.
Proposition 5
In the one-pool case, suppose we choose
for some constant \(C > 0\), and
Then the contaminated pool is capable of achieving subset support recovery with probability at least \(1 - e^{-\frac{C^2}{2}}\).
In the two-pool case, suppose we choose
for some constant \(C' > 0\), and
Then adding clean points will achieve subset support recovery with probability at least \(1 - e^{-\frac{C'^2}{2}}\).
As stated in Theorems 1 and 2, to ensure exact recovery, we also need to impose a gamma-min condition. This leads to the following proposition:
Proposition 6
In the one-pool case, suppose inequality (35) holds. If also
then there exists a \(\lambda\) to achieve exact recovery, with probability at least \(1 - 2e^{-\frac{c^2}{2}} - e^{-\frac{C^2}{2}}\).
In the two-pool case, suppose \(\eta \le \frac{mL}{n}\), and inequality (37) holds. If also
then there exists a \(\lambda\) to achieve exact recovery, with probability at least \(1 - 2e^{-\frac{c^2}{2}} - e^{-\frac{C^2}{2}}\).
Compare (35) and (37). Mutual incoherence is decreased from \(\frac{r_i^2}{f_i^2}\) to \(\frac{r_i^2}{f_i^2 + \frac{\eta n}{m}w_i^2}\). Compare (38) and (39). The second \(\max\) term, \(\underset{1 \le i \le t}{\max }\frac{r_i^2}{f_i^2} \ge \underset{1 \le i \le t}{\max }\frac{r_i^2(Lf_i^2+\frac{\eta n}{m}w_i^2)}{L(f_i^2 + \frac{\eta n}{m}w_i^2)^2}\), because
when \(L \ge 1\). Also note that \(\frac{1}{1 - \alpha } > \frac{1}{1-\alpha '}\). Altogether, the requirement of \(\min _{i \in [t]}|\gamma ^*_i|\) is weakened by introducing clean points. Thus, we see that the mutual incoherence improves in two-pool setting. The gamma-min condition imposes a lower bound of \(\varOmega \left( \sqrt{\log (n-t)}\right)\) on the signal-to-noise ratio, \(\frac{\min _{i \in [t]} |\gamma ^*_i|}{\sigma }\), and including second pool reduces the prefactor.
As can be seen, we want \(|w_i|\) to be sufficiently large compared to \(|f_i|\). However, if \(|w_i|\) is bounded, we may instead ensure support recovery by repeating points. In this section, we discuss the effect of repeating points and determine the number of points needed to guarantee correct support recovery. Suppose
where \(\mathbf {w_i} = [w_{i1}, \dots , w_{il_i}]^\top\). For the ith direction \(q_i\), we have \(k_i\) repeated points with respective weights \(w_{i1}, w_{i2}, \dots , w_{il_i}\).
Proposition 7
Suppose the scale of clean data points is bounded by \(w_B\). Using \(w_{i1}, \dots , w_{il_i}\), where \(l_i = \left\lceil \left( {\frac{|w_i|}{w_B}}\right) ^2\right\rceil\) and \(|w_{ij}| = w_B,\ \forall j \in [l_i]\), achieves the same effect on Conditions 1, 2, and 3as adding a single point with scale \(w_i\).
From Proposition 7, we see that to correctly identify the bugs, we can also query multiple points in the same direction if the leverage of a single additional point is not large enough.
1.2.2 C.2.2 Proofs for orthogonal design
In this section, we first simplify the three conditions, and then provide the proofs of Propositions 5, 6, and 7.
In the one-pool case, we have
Note that \(P_{X,TT}^\perp\) is a diagonal matrix. Thus, the eigenvalues are immediately obtained and
The condition that \(P_{X, TT}^\perp\) is invertible is therefore equivalent to the condition that \(f_i \ne 0\) for all i. Assuming this is true, we have
The mutual incoherence condition can then be written in terms of the quantity
Note that the mutual incoherence condition also implies that \(f_i \ne 0,\ \forall i\), since the mutual incoherence parameter will otherwise go to infinity.
The remaining condition is the gamma-min condition. Note that the upper bound on the \(\ell _\infty\)-error of \(\gamma\) consists of two parts:
Regarding \(P_{X, T\cdot }^\perp\) as two blocks, \(\left( P_{X, TT}^\perp ,\ P_{X, TT^c}^\perp \right)\), we have
Altogether, we see that
To summarize, the minimum eigenvalue condition becomes
the mutual incoherence condition becomes
and the gamma-min condition becomes
Similar calculations show that in the two-pool case, the minimum eigenvalue condition becomes
the mutual incoherence condition becomes
and the gamma-min condition becomes
where
Here is the proof of Proposition 5.
Proof of Proposition 5
According to Theorem 1, the subset support recovery result relies on two conditions: the minimum eigenvalue condition and the mutual incoherence condition. In the orthogonal design case, we will argue that both inequalities (40a) and (41a) hold in the one-pool case, and inequlaity (37) is sufficient for both inequalities (41a) and (41b) in the two-pool case.
For the one-pool case, the assumption (35) implies that \(f_i \ne 0,\ , \forall i \in [t]\). Note that the minimum eigenvalue condition (40a) is equivalent to \(f_i \ne 0,\ , \forall i \in [t]\). Hence, the minimum eigenvalue condition holds. Furthermore, the mutual incoherence condition (41a) clearly holds.
For the two-pool case, if \(f_i = 0\) for some \(i \in [t]\), then plugging into (37) implies that \(w_i^2 > 0\). Thus, \(f_i\) and \(w_i\) cannot be zero at the same time, implying that the eigenvalue condition (41a) holds. Note that inequality (37) is equivalent to inequlaity (41b).
The remaining of the argument concerns the choice of \(\lambda\). Note that Theorem 1 requires \(\lambda\) to be lower-bounded for subset recovery (see inequality (9)). Taking the two-pool case as an example, we will show that when inequality (36) holds, inequality (9) holds with high probability. Define
Note that \(\left\Vert \bar{P}_{\cdot j}^\top \Big (I - \bar{P}_T(\bar{P}_T^\top \bar{P}_T)^{-1}\bar{P}_T^\top \Big )\right\Vert _2 \le 1\) for all \(j \in T^c,\) and \(\epsilon ' = \begin{pmatrix} \epsilon \\ \sqrt{\frac{\eta n}{m}}\widetilde{\epsilon }\end{pmatrix}\) has i.i.d. sub-Gaussian entries with parameter at most \(\max \{1,\frac{\eta n}{mL}\}\sigma ^2\). Thus, \(Z_j\) is sub-Gaussian with parameter at most \(\max \{1,\frac{\eta n}{mL}\}\frac{\sigma ^2}{n^2}\). By a sub-Gaussian tail bound (cf. Lemma 1), we then have
Let \(C'\) be a constant such that
and define
Note that we want
which therefore occurs with probability at least \(1 - e^{-\frac{C'^2}{2}}\) when
The proof for the one-pool case is similar, so we omit the details. \(\square\)
Here is the proof of Proposition 6.
Proof of Proposition 6
To simplify notation, define
Note that \(u_i\) is \(\sigma _{u_i}\)-sub-Gaussian and \(v_i\) is \(\sigma _{v_i}\)-sub-Gaussian, with variance parameters
We now prove two technical lemmas:
Lemma 1
(Concentration for non-identical sub-Gaussian random variables) Suppose \(\{u_i\}_{i=1}^t\) are \(\sigma _{u_i}\)-sub-Gaussian random variables and \(\{v_i\}_{i=1}^t\) are \(\sigma _{v_i}\)-sub-Gaussian random variables. Then the following inequalities hold:
Proof
Note that
where \(u_{t+i} :=-u_i\), for \(1 \le i \le t\). By a union bound, we have
For each \(u_i\), we have the tail bounds
Altogether, we see that
Similarly, we may obtain the desired concentration inequality for the \(v_i\)’s:
\(\square\)
Lemma 2
In the one-pool case, under the orthogonal design setting, suppose
where \(\sigma _{u_i} = \sqrt{1+\frac{r_i^2}{f_i^2}}\sigma\). Then the gamma-min condition holds with probability at least \(1-2e^{-c_1^2/2}\).
In the two-pool case, suppose
where \(\sigma _{v_i} = \sqrt{1+\frac{r_i^2(L^2f_i^2+\frac{\eta n}{m}w_i^2)}{L^2(f_i^2 + \frac{\eta n}{m}w_i^2)^2}}\sigma\). Then the gamma-min condition holds with probability at least \(1-2e^{-c_2^2/2}\).
We use inequality (42) in Lemma 1. Let \(\delta _1 = \sqrt{2\log t + c_1^2} \max _{1 \le i \le t}\sigma _{u_i}\) where \(c_1 \in (0,+\infty )\). Then with probability \(1 - 2e^{-\frac{c_1^2}{2}}\), the following holds:
In inequality (43), take \(\delta _2 = \sqrt{2\log t + c_2^2} \max _{1 \le i \le t}\sigma _{u_i}\) where \(c_2 \in (0, +\infty )\). Then with probability \(1 - 2e^{-\frac{c_2^2}{2}}\), the following holds:
Combining these inequalities with conditions (40c) and (41c), we obtain \(G \le \min _{i \in [t]}|\gamma ^*_i|\) with probability at least \(1 - 2e^{-\frac{c_1^2}{2}}\) or at least \(1 - 2e^{-\frac{c_2^2}{2}}\). Specifically, when we choose \(c_1 = c_2 = 2.72\), we can achieve a probability guarantee of at least \(95\%\) for the two statements.
Therefore, Proposition 6 is proved by plugging the results from Lemma 1 into Lemma 2. \(\square\)
Here is the proof of Proposition 7.
Proof of Proposition 7
We will prove the proposition by comparing the three conditions in the two situations: adding one clean point and repeating multiple clean points. The conditions for adding one clean point are already provided in inequalities (41a), (41b) and (41c) above.
We now provide the conditions for repeating multiple clean points. The minimum eigenvalue condition becomes
the mutual incoherence condition becomes
and the gamma-min condition becomes
Compared with inequlaities (41a), (41b) and (41c), conditions (46a), (46b) and (46c) replace \(w_i^2\) by \(\sum _{j = 1}^{l_i}w_{ij}^2\). Suppose the scale of the clean data points is bounded by \(w_B\). Then adding one data point may not be enough to satisfy the three conditions. Thus, to achieve the same effect of a large scaled \(|w_i|\) in inequalities (41a), (41b) and (41c), we need the number of repeated clean points to be at least \(\left( \frac{|w_i|}{w_B}\right) ^2\). \(\square\)
1.3 C.3 Sub-Gaussian design
In this section, we will present the support recovery results for sub-Gaussian design in Proposition 8 and Proposition 9, and the comparisons of the three conditions in the one- and two-pool cases in Table 1. Later, we will provide the proofs of the propositions.
1.3.1 C.3.1 Main results for sub-Gausian design
Proposition 8
Suppose \(\{x_j\}_{j \in T^c}\) and \(\{\widetilde{x}_i\}_{i \in [m]}\), are i.i.d. sub-Gaussian with parameter \(\sigma ^2_x\) and covariance matrix \(\varSigma \succ 0\). Further assume that \(\Vert X_T\Vert _2 \le B_T\). For the one-pool case, suppose we choose \(\lambda\) to satisfy inequality (34) and the sample size satisfies
then the contaminated pool achieves subset support recovery with probability at least \(1 - e^{-\frac{C^2}{2}} - 2e^{-C_1} - n^{-(c_2-1 )}\).
For the two-pool case, assume we choose \(\lambda\) to satisfy (36) and the sample sizes satisfy
and
Then adding clean points achieves subset support recovery with probability at least \(1 - e^{-\frac{C'^2}{2}} - 2e^{-C_1'} - n^{-(c_2-1)}\).
As seen in Proposition 8, the number of data points n may be reduced by \(1+\eta\) with the introduction of a second data pool.
Note that when T is randomly chosen from [n], we have \(B_T = O(\sqrt{t}\Vert \varSigma \Vert _2)\), so inequalities (47) and (48) require \(\frac{t}{n}\) to be upper-bounded, and adding a second pool may weaker the upper bound to be \((1+\eta )\) than the upper bound for one-pool case.
We now present a result concerning exact support recovery:
Proposition 9
In the one-pool case, suppose inequality (47) holds. If
then there exists a \(\lambda\) to achieve exact recovery with probability at least \(1 - 2e^{-c} - e^{-\frac{C^2}{2}} - 2e^{-C_1} - n^{-C_2}\).
For the two-pool case, suppose the assumptions in Proposition 8hold, and
Then there exists a \(\lambda\) to achieve exact recovery with probability at least \(1 - 2e^{-c} - e^{\frac{-C'^2}{2}} - 2e^{-C_1'} - n^{-C_2}\).
Compared to Propositions 8, 9 additionally requires the “signal-to-noise” ratio to be large enough. We can show that \(b_{\min } \le b'_{\min }\); thus, for an appropriate choice of \(\eta\), the lower bound (49) is smaller than the bound (50), so the gamma-min condition is improved.
We now briefly compare the three conditions for the one- and two-pool cases in the random design setting.
In general, the eigenvalue condition is improved by adding a second pool. The mutual incoherence condition is improved in the two-pool case with large m by a constant multiplier \(\frac{1}{1+ \eta \frac{n}{m}}\ (\le 1)\), and the gamma-min condition lower bound is improved by a constant \(\frac{b_{\min }}{b_{\min }'}\ (\le 1)\).
For the eigenvalue condition, the key result is that adding clean data points will not hurt, i.e., it makes the minimum eigenvalue smaller. A formal statement is provided in Proposition 10. Recall that
where \(X' = \left( \begin{array}{c} X \\ \sqrt{\frac{\eta n}{m}} \widetilde{X}\end{array}\right)\), and we assume that \(X^\top X\) is invertible.
Proposition 10
(Comparison of minimum eigenvalue conditions) We have
Note that the result of Proposition 10 does not require any assumptions on \(\widetilde{X}\) or \(\eta\). However, the degree of improvement depends on \(\eta\), as seen in the proof. Usually when n is small, increasing \(\eta\) leads to a big jump of the minimum eigenvalue; when n is large, increasing \(\eta\) does not change the minimum eigenvalue much. A typical relationship between \(\eta\) and \(\lambda _{\min }\left( P^\perp _{X', TT}\right)\) can be seen in Fig. 10.
For mutual incoherence condition, it is possible to find settings for small m that make the mutual incoherence condition worse. Consider the following example:
Example 3
(Example where the mutual incoherence condition worsens) Suppose
Then
Despite this negative example, we can show that including a second pool helps when m is large compared to p. Recalling the assumption that \(X_{T^c}^\top X_{T^c}\) is invertible, we can write
The first equality uses the definitions of \(P_{X, T^cT}^\perp\) and \(P_{X,TT}^\perp\), the second equality uses the Woodbury matrix identity (Henderson and Searle 1981), and the third equality follows from simple linear algebraic manipulations.
Similarly, we can simplify the mutual incoherence condition for the two-pool case, by replacing \(X_{T^c}^\top X_{T^c}\) with \(X_{T^c}^\top X_{T^c} + \eta \frac{n}{m} \widetilde{X}^\top \widetilde{X}\) in the inverse:
where we know that \(X_{T^c}^\top X_{T^c} + \eta \frac{n}{m} \widetilde{X}^\top \widetilde{X}\) must be invertible since \(X_{T^c}^\top X_{T^c}\) is invertible.
Given these simplifications, it is easy to see that the difference between these two terms lies in the middle inverses. When m is large, we have \((X_{T^c}^\top X_{T^c})^{-1} \approx ((n-t)\varSigma )^{-1}\) and \(\left( X_{T^c}^\top X_{T^c} + \eta \frac{n}{m} \widetilde{X}^\top \widetilde{X}\right) ^{-1} \approx \left( \left( n-t+\eta n\right) \varSigma \right) ^{-1}\), where \(\varSigma\) is the covariance matrix for the common distribution of \(X_{T^c}\) and \(\widetilde{X}\). Therefore, the mutual incoherence parameter in the one-pool case is approximately equal to the mutual incoherence in the two-pool case scaled by \(\left( 1 + \eta \frac{n}{n-t}\right) ^{-1}\), which immediately implies that adding a second data pool improves the mutual incoherence condition. This is stated formally in the following proposition:
Proposition 11
(Comparison of mutual incoherence conditions) Let \(B_T = O(\sqrt{t})\). In the one-pool case, if \(n \ge t + \frac{c_1^2\sigma _x^4 (p+C_1)\Vert \varSigma \Vert ^2}{\lambda _{\min }^2(\varSigma )}\),then
with high probability.
In the two-pool case, if \(n \ge t + \max \Big \{\frac{c_1^2\sigma _x^4 \Vert \varSigma \Vert ^2}{\lambda _{\min }^2(\varSigma )},1\Big \}m\) and \(m \ge \max \{1,c_1^2 \sigma ^4_x (p+C_1')\Vert \varSigma \Vert _2^2\}\), then
with high probability.
Proposition 11 states that when m and n are sufficiently large, the one-pool mutual incoherence parameter is close to \(\frac{\left\| {X_{T^c}\varTheta X_T^\top }\right\| _\infty }{n-t}\) and the two-pool mutual incoherence parameter is close to \(\frac{\left\| {X_{T^c}\varTheta X_T^\top }\right\| _\infty }{n-t+\eta n}\). Since the second expression has a larger denominator, the mutual incoherence condition improves with the introduction of a second data pool with parameter \(\eta > 0\).
For gamma-min condition, we need to compare the terms G and \(G'\). Note that inequalities (49) and (50) are equivalent to lower-bounding the “signal-to-noise” ratio. The order of the lower bound for two-pool case is as same as the one-pool case, i.e., \(\frac{\min _{i}|\gamma ^*_i|}{\sigma } \ge O(\sqrt{t \log {n}})\). However, adding a second pool improves the constant by having a factor of \(\frac{1}{b_{\min }'}\) instead of \(\frac{1}{b_{\min }}\). As established in Proposition 10, we have \(b_{\min } \le b_{\min }'\). Therefore, the lower bound in the two-pool case is smaller than the lower bound in the one-pool case.
Note that the weight parameter \(\eta\) shows up in all the three conditions. However, recall that the mutual incoherence condition is not always improved by adding a second pool, unless m is sufficiently large. Therefore, an appropriate conclusion is that once we have a large clean data pool, it is reasonable to place arbitrarily large weight on the second pool. On the other hand, if we have fewer clean data points, we cannot be as confident about the estimator obtained using the second pool alone. For example, in the orthogonal design, if we obtain clean points in the non-buggy subspace, the mutual incoherence condition is not improved no matter how large we make \(\eta\). In addition, the gamma-min condition involves the randomness from noise, and in order to control the sparsity of \(\gamma\), we need the regularizer \(\lambda\) to match large \(\eta\) [cf. inequality (36)]. Based on inequality (50), we need the “signal-to-noise” ratio, i.e., \(\frac{n\lambda \sqrt{t}}{\sigma }\), to be sufficient large. If \(\eta\) is too large, we cannot estimate relatively small components of \(\gamma ^*\). In summary, selecting \(\eta\) too large or too small is not wise: If \(\eta\) is too small, we do not improve the three conditions, whereas if \(\eta\) is too large, the range of controllable “signal-to-noise” ratios decays.
1.3.2 C.3.2 Proofs for sub-Gaussian design
In this section, we provide proofs of sub-Gaussian design. Here is the proof of Proposition 8.
Proof of Proposition 8
We prove the results for the one- and two-pool cases sequentially. In each case, we begin with background calculations, and then analyze the eigenvalue condition followed by the mutual incoherence condition.
For the one-pool case, we know that \(\lambda\) satisfies inequality (34) with probability at least \(1 - e^{-\frac{C^2}{2}}\).
Note that \(x_j,\ j \in T^c\) are sub-Gaussian random vectors with parameter \(\sigma _x\). By Theorem 4.7.1 and Exercise 4.7.3 in Vershynin (2018) and our assumption of n, we have
with probability at least \(1 - e^{-C_1}\). We will later use this bound multiple times to establish the eigenvalue condition and the mutual incoherence condition.
We first consider the eigenvalue condition. By the dual Weyl’s inequality (Horn and Johnson 1994), we have \(\lambda _{\min }(A+B) \ge \lambda _{\min }(A) + \lambda _{\min }(B)\) for any square matrices A and B. Then
where the second inequality follows from the fact that \(\lambda _{\min }(A) \le \lambda _{\max }(A)\) for any square matrix A. Combining this with inequality (53) and taking \(n \ge t + 4\frac{c_1^2 \sigma _x^4 (p + C_1)\Vert \varSigma \Vert _2^2}{\lambda ^2_{\min }(\varSigma )}\) by assumption (47), we have that
with probability \(1 - e^{-C_1}\). We now derive the following result:
Lemma 3
Suppose \(X_{T^c}^\top X_{T^c}\) is invertible, where \(X_T \in \mathbb {R}^{t \times p}\) and \(X_{T^c} \in \mathbb {R}^{(n-t) \times p}\). Then
implying that the eigenvalue condition for the one-pool case holds.
Proof
Define \(C = Q(I + Q^\top Q)^{-1}Q^\top\) and \(Q \in \mathbb {R}^{s \times p}\), and suppose rank(Q) = r. Let \(Q = U S V^\top\) be the SVD, where \(U\in \mathbb {R}^{t \times p}, V \in \mathbb {R}^{p \times p}\), and \(S = \begin{bmatrix} J_{r \times r} & 0_{r \times (p-r)}\\ 0_{(t-r) \times r} & 0_{(t-r) \times (p-r)}\end{bmatrix}\). Here, J is a diagonal matrix of positive singular values. Then
Therefore, \(\lambda _{\max }(C) = \frac{a^2_{\max }}{1 + a^2_{\max }}\), where \(a_{\max }\) is the maximum singular value appearing in J. Also note that \(a_{\max }^2\) is the maximum eigenvalue of \(Q^\top Q\).
Following (16.51) in Seber (2008), given \(X_{T^c}^\top X_{T^c}\) is invertible, there exists a non-singular matrix A such that \(AX_{T^c}^\top X_{T^c}A^\top = I\) and \(AX_T^\top X_TA^\top = D\), where D is diagonal matrix.
Note that
where \(Q:=X_TA^\top\).
Based on our earlier arguments, we know that the matrix under consideration has maximum eigenvalue \(\frac{\lambda _{\max }(AX_T^\top X_TA^\top )}{1 + \lambda _{\max }(AX_T^\top X_TA^\top )}\). Since \(AX_T^\top X_TA^\top\) is similar to \(X_T^\top X_TA^\top A\), we have\(\lambda _{\max }(AX_T^\top X_TA^\top ) = \lambda _{\max }(X_T^\top X_TA^\top A)\). Furthermore, we have \(A^\top A = (X_{T^c}^\top X_{T^c})^{-1}\), implying that
Altogether, we have
Finally, we may conclude that
Since \(\lambda _{\min }(X_{T^c}^\top X_{T^c}) > 0\), we have \(\lambda _{\min }\left( P_{X, TT}^\perp \right) < 1\), implying the desired result. \(\square\)
We now consider the mutual incoherence condition. By the triangle inequality, we have
We bound \(\textcircled {1}\) and \(\textcircled {2}\) separately. Note that
In order to bound \(\textcircled {1}\), we bound three parts separately. By assumption, we have \(\left\| {X_T^\top }\right\| _2 \le B_T\). For \(\max _{j \in T^c} \Vert x_j\Vert _2\), we leverage the Hanson-Wright inequality (Theorem 6.2.1 in Vershynin (2018)) and a union bound. By the Hanson-Wright inequality, we see that for \(t > 0\),
where c is an absolute constant.
By a union bound, we then have
Setting \(\varDelta = c_2 \sigma _x^2\max \{\sqrt{p\log n}, \log n\}\) with \(c_2 \ge 1\) so that we have \(\min \left\{ \frac{\varDelta ^2}{\sigma _x^4 p}, \frac{\varDelta }{\sigma _x^2}\right\} \ge c_2 \log n\), we conclude that
with probability at least \(1 - n^{-(c_2-1)}\), where \(c_2 \ge \max \{2,2/c\}\).
To bound \(\left\| { \varTheta - \left( \frac{X_{T^c}^\top X_{T^c}}{n-t}\right) ^{-1} }\right\| _2\), note that for two matrices A and B, we have
Combining this fact with inequalities (53) and (54), we obtain
Altogether, we obtain the bound
We now consider \(\textcircled {2}\). Note that
Therefore,
Finally, assuming n satisfies the bound (47), and taking a union bound over all the probabilistic statements appearing above, we conclude that the mutual incoherence condition holds with probability at least \(1 - e^{-\frac{C^2}{2}} - 2e^{-C_1} - n^{-(c_2-1)}\). This concludes the proof.
For the two-pool case, we will use the following inequalities:
with probablity at least \(1 - 2e^{-C_1'}\). Combining these inequalities and using the triangle inequality, we obtain
with probability at least \(1 - 2e^{-C_1'}\).
Analogous to Lemma 3, we can conclude that if \(X_{T^c}^\top X_{T^c} + \frac{\eta n}{m}\widetilde{X}^\top \widetilde{X}\) is invertible, the eigenvalue condition satisfies
(This can be proved just by replacing \(X_{T^c}^\top X_{T^c}\) with \(X_{T^c}^\top X_{T^c} + \frac{\eta n}{m} \widetilde{X}^\top \widetilde{X}\) in the proof of Lemma 3.) However, since we further wish to bound the minimum eigenvalue from below by \(\lambda _{\min }(\varSigma )/2\), to match the one-pool case and to be used in the proof for the mutual incoherence condition later, we will consider \(X_{T^c}^\top X_{T^c} + \frac{\eta n}{m} \widetilde{X}^\top \widetilde{X}\) directly.
Note that
Thus, if we choose \(m \ge 4c_1^2 \sigma ^4_x (p+C_1')\Vert \varSigma \Vert _2^2\), we have
with probability at least \(1 - 2e^{-C_1'}\).
We now consider the mutual incoherence condition. Similar to the derivation of inequality (58), we have that
Combining this with inequality (57), we obtain
Therefore, together with the triangle inequality and inequality (60), we can bound the mutual incoherence parameter as follows:
By the assumption on n in inequality (48), the mutual incoherence condition therefore holds with probability \(1 - e^{-\frac{C'^2}{2}} - 2e^{-C_1'} - n^{-(c_2-1)}\). \(\square\)
Here is the proof of Proposition 9.
Proof of Proposition 9
To achieve exact support recovery, we need all the three conditions to hold. The eigenvalue condition and the mutual incoherence condition have already been discussed in the analysis of subset support recovery in “Appendix 8”, so it remains to analyze the gamma-min condition.
Recall that
To simplify notation, we define
We also define the random variables
Since \(P_{X'}^\perp\) is a projection matrix and the maximum singular value of \(P_{X', T\cdot }^\perp\) is smaller than the maximum singular value of \(P_{X'}^\perp\)’s, we have
for all \(i \in T\). Note that \(Z_i\) is a zero-mean sub-Gaussian random variable with parameter at most \(\frac{\sigma }{b'_{\min }}\). By a sub-Gaussian tail bound, we then have
Therefore, with probability at least \(1 - 2e^{-c}\), we have \(A \le \frac{2\sigma \sqrt{\log t+c}}{b'_{\min }}.\) Note that \(\Vert (P_{X', TT}^\perp )^{-1}\Vert _\infty \le \sqrt{t} \Vert (P_{X', TT}^\perp )^{-1}\Vert _2 = \frac{\sqrt{t}}{b_{\min }'}.\) We can then immediately obtain the bound \(B \le \frac{2n\lambda \sqrt{t}}{b_{\min }'}\).
Combined with the fact that \(\lambda \ge \frac{2\sigma }{n(1-\alpha ')} \max \left\{ 1, \sqrt{\frac{\eta n}{mL}}\right\} \left( \sqrt{\log 2(n-t)}+C'\right)\), we then obtain
Thus, as long as \(\min _{i \in T} |\gamma ^*_i|\) is greater than or equal to the RHS of the inequality above, the gamma-min condition holds with probability at least \(1 - 2e^{-c} - e^{-\frac{C'^2}{2}}\). Consequently, the exact support recovery is achieved.
The proof of the one-pool case is similar as the proof of the two-pool case provided above, so we omit the details here. \(\square\)
Here is the proof of Proposition 10
Proof
Proof of Proposition 10
By the Sherman–Morrison–Woodbury formula (Henderson and Searle 1981), we have
We now state and prove two useful lemmas:
Lemma 4
Assume \(X^\top X\) is invertible. Define
Then \(\lambda _{\min }\left( A\right) \ge 0\). Equality holds when \(\widetilde{X}\left( X^\top X \right) ^{-1}X_T^\top\) is not full-rank.
Proof
First note that since \(X^\top X\) is invertible and \(\widetilde{X}(X^\top X)^{-1}\widetilde{X}^\top \succ 0\), the matrix \(I+\frac{\eta n}{m} \widetilde{X}(X^\top X)^{-1}\widetilde{X}^\top\) is invertible. Note that
so the minimum eigenvalue of A is nonnegative.
In order to study when the \(\lambda _{\min } = 0\), let \(z = \widetilde{X}\left( X^\top X \right) ^{-1}X_T^\top y\). When \(y \ne 0\) and \(\widetilde{X}\left( X^\top X \right) ^{-1}X_T^\top\) is full-rank, we have \(z \ne 0\). Thus, if \(\widetilde{X}\left( X^\top X \right) ^{-1}X_T^\top\) is full-rank, we have \(\lambda _{\min }(A) > 0\). When \(y \ne 0\) and \(\widetilde{X}\left( X^\top X \right) ^{-1}X_T^\top\) is not full-rank, there exists \(y \ne 0\) such that \(z = 0\), which causes \(y^\top A y = 0\) and \(\lambda _{\min }(A) = 0\). \(\square\)
Lemma 5
The following equations holds:
Proof
Since \(X_T\left( X^\top X\right) ^{-1}X_T^\top\) is symmetric positive semidefinite, we can write \(X_T\left( X^\top X\right) ^{-1}X_T^\top = Q \varLambda Q^\top\), where Q is an orthogonal matrix and \(\varLambda\) is a diagonal matrix with nonnegative diagonals. Then
Furthermore, we have shown in inequality (56) that
Hence, the maximum diagonal in \(\varLambda\) is upper-bounded by 1, and \(I -\varLambda\) has all diagonal entries in the range [0, 1]. Thus, we have shown that \(\min {{\text {diag}}(I- \varLambda )} = \max ({\text {diag}}(\varLambda ))\), implying the conclusion of the lemma. \(\square\)
Returning to the proof of the proposition, we have
Here, (i) comes from the fact that
which follows from Lemma 4. Furthermore, by Lemma 5, we have
and
Altogether, we conclude that the minimum eigenvalue is at least improved by \(\frac{\eta n}{m}\lambda _{\min }\left( X_T\left( X^\top X \right) ^{-1} \widetilde{X}^\top (I+\frac{\eta n}{m} \widetilde{X}(X^\top X)^{-1}\widetilde{X}^\top )^{-1}\widetilde{X}\left( X^\top X \right) ^{-1}X_T^\top \right)\). \(\square\)
Here is the proof of Proposition 11.
Proof of Proposition 11
The proof leverages arguments from the proof of Proposition 8. The goal is to argue that when n and m are sufficiently large, the empirical quantities are close to their population-level versions. We will use Big-O notation to simplify our discussion.
As already stated in inequality (59), if \(n \ge t + \frac{c_1^2\sigma _x^4 \Vert \varSigma \Vert ^2}{\lambda _{\min }^2(\varSigma )}(p+C_1)\), then
with probability at least \(1 - e^{-C_1} - n^{-1}\), where \(c_2 > \max \{2,2/c\}\).
Also for the two-pool case, if \(n \ge t + \max \Big \{\frac{c_1^2\sigma _x^4 \Vert \varSigma \Vert ^2}{\lambda _{\min }^2(\varSigma )},1\Big \}m\) and \(m \ge \max \{1,c_1^2 \sigma ^4_x (p+C_1')\Vert \varSigma \Vert _2^2\}\), we have
with probability at least \(1 - 2e^{-C_1'} - n^{-1}\), where \(c_2\) is defined in the same way as above. Noting that \(B_T \propto \sqrt{t}\) and using the triangle inequality, we conclude the proof. \(\square\)
D Proofs for Sect. 4
In this section, we provide proofs and additional details for the results in Sect. 4. We will establish several auxiliary results in the process, which are stated and proved in “Appendix D.4”. The flow of logic is outlined below:
Theorem 3\(\Leftarrow\) (Lemma 6, Lemma 12);
Lemma 6\(\Leftarrow\) Theorem 2;
Lemma 12\(\Leftarrow\) (Lemma 7, Lemma 11);
Lemma 11\(\Leftarrow\) (Lemma 8, Lemma 9);
Lemma 9\(\Leftarrow\) Lemma 7.
Corollary 1\(\Leftarrow\) (Theorem 3, Corollary 2).
We sometimes write \(\widehat{\gamma }(\lambda )\) to represent the estimator from Lasso-based debugging with tuning parameter \(\lambda\).
1.1 D.1 Proof of Theorem 3
We will first argue that the algorithm will stop, and then argue that all bugs are identified correctly when the algorithm stops. Finally, we will take a union bound over all the iterations in the while loop to obtain a probabilistic conclusion.
Algorithm 1 stops: Note that if we have an iteration k such that \(\widehat{\lambda }^k > 2\lambda ^*\) and \(C = 0\), then the algorithm must stop after at most \(\lfloor \log _2 \frac{\lambda ^u}{\lambda ^*} \rfloor\) iterations. Otherwise, we know that \(C = 1\) for all iterations k such that \(\widehat{\lambda }^k \ge \lambda ^*\). Thus, after \(k = \lfloor \log _2\frac{\lambda ^u}{\lambda ^*}\rfloor\) iterations, we have
As established in Lemma 6, we know that all true bugs will be identified with such a value of \(\lambda ^k\), so the remaining points are \((X^{(k)}, y^{(k)}) = (X_{T^c}, y_{T^c})\). Also note that
Hence, by Lemma 12, we have
Therefore, the stopping criteria takes effect and the algorithm stops.
Algorithm 1 correctly identifies all bugs: A byproduct of the preceding argument is that \(\widehat{\lambda }> \lambda ^*\). By Theorem 1, we have \({\text {supp}}(\widehat{\gamma }^k) \subseteq {\text {supp}}(\gamma ^*)\). Now suppose we are at a stage where l of the t bugs are flagged, where \(l \in \{0,1,\dots ,t\}\).
If \(l = t\), then \({\bar{X}} = X_{T^c}\). As argued preveiously, the algorithm stops with high probability. Hence, we output all of the bugs.
Otherwise, we have \(l \le t-1\). Suppose this happens at the \(k\)th iteration. Then at least one bug remains in \((X^{(k)}, y^{(k)})\), and all the clean points are included. Let S denote the corresponding row indices of X and let \(\gamma ^*_S\) denote the following subvector of \(\gamma ^*\). Since bugs still remain, we must have \(\min _{i\in S} |\gamma ^*_{S,i}| \ge \min _{i\in T} |\gamma ^*_i|\). Furthermore,
By Lemma 12, we have
implying that \(C = 0\). Thus, the procedure proceeds to the \((k+1)^{\text {st}}\) iteration. If for all k such that \(\widehat{\lambda }^k \ge 2\lambda ^*\), bugs still remain, then \(\widehat{\lambda }^k\) keeps shrinking until the \(\lfloor \log _2\frac{\lambda ^u}{\lambda ^*}\rfloor ^{\text {th}}\) iteration. Then the tuning parameter must lie in the interval \((\lambda ^*,2\lambda ^*]\), resulting in a value of \(\widehat{\gamma }\) such that \({\text {supp}}(\widehat{\gamma }) = {\text {supp}}(\gamma ^*)\).
Probability by union bound: Now we study the probability for this algorithm to output a value of \(\widehat{\gamma }\) that achieves exact recovery. Firstly, the algorithm stops as long as Lemmas 6 and 12 hold, which holds with probability at least \(1 - \frac{3}{n-t} - 2\exp \left( -2\left( \frac{1}{2}-c_t-\nu \right) ^2n\right)\).
Secondly, consider the argument that the algorithm correctly identifies all bugs. For each iteration, the events \(\{C = 0 \text { if a bug still exists}\}\) and \(\{C=1 \text { if no bugs exist}\}\) hold as long as Lemmas 6 and 12 hold, which happens with probability at least \(1 - \frac{3}{n-t} - 2\exp \left( -2\left( \frac{1}{2}-c_t-\nu \right) ^2n\right)\). If the algorithm has K iterations, the probability that the algorithm flags all bugs is therefore at least \(1 - \frac{3K}{n-t} - 2K\exp \left( -2\left( \frac{1}{2}-c_t-\nu \right) ^2n\right)\) by a union bound. Since we have argued that \(K \le \log _2 \frac{\lambda ^u}{\lambda (\sigma ^*)}\), the desired statement follows.
1.2 D.2 Proof of Corollary 1
According to the PDW procedure, we can set \(\widehat{\gamma }= \mathbf {0}\), solve for \({\hat{z}}\) via the zero-subgradient equation, and check whether \(\Vert {\hat{z}}\Vert _\infty <1\), where \(\widehat{z}\) is a subgradient of \(\Vert \widehat{\gamma }\Vert _1\). The gradient of the loss function is equal to zero, which implies that
Therefore, we see that \(\Vert \widehat{z}\Vert _\infty < 1\) for \(\lambda > \frac{\Vert {\bar{P}}^\top P_{X'}^\perp y'\Vert _\infty }{n}\), which means the optimizer satisfies \(\widehat{\gamma }= \mathbf {0}\). Since \(\lambda _u = \frac{2\Vert {\bar{P}}^\top P_{X'}^\perp y'\Vert _\infty }{n}\), the output with tuning parameter \(\lambda _u\) gives \(\widehat{\gamma }(\lambda _u) = 0\).
Note that
by the triangle inequality. The second term is bounded by \(2\max \{1,\sqrt{\frac{\eta n}{mL}}\}\sqrt{\log 2n}\, \sigma ^*\) with probability at least \(1- \frac{1}{n}\), since \(e_j^\top P_{X'}^\perp \epsilon '\) is Gaussian with variance at most \(\max \{1,\sqrt{\frac{\eta n}{mL}}\}\sigma ^*\). For the first term, we have
where (i) holds because \(\Vert v^\top \gamma ^*\Vert _1 = \sum _{i\in T} |v_i \gamma ^*_i| \le t \Vert v\Vert _\infty \Vert \gamma ^*\Vert _\infty\) for any row v of the matrix \(\bar{P}^\top \bar{P}\), and (ii) holds because \(\bar{P}^\top \bar{P}\) is a submatrix of the projection matrix \(P_{X'}^\perp\) and each entry of a projection matrix is upper-bounded by 1. Altogether, we obtain
By a similar argument as in Theorem 3 and Corollary 2, we know that Algorithm 1 stops with at most \(\log _2 \frac{\lambda _u }{\lambda (\sigma ^*)}\) with probability at least \(1 - \frac{1}{n-t}\). Hence,
where (1) comes from the fact that \(\bar{P}_{T^c}^\perp\) is a submatrix of \(P_{X'}^\perp\), which has spectral norm 1 when \(n \ge t+p+1\); and (2) holds because \(1-\alpha '<1\). To illustrate that \(\Vert \bar{P}_{T^c}^\perp \Vert _2 = 1\), note that it is sufficient to show \(\Vert P_{X',T^c T^c}^\perp \Vert _2 =1\) \(P_{X',T^c T^c}^\perp\) is a principal matrix of \(P_{X'}^\perp\). By interlacing theorem (Hwang 2004), we know that \(\lambda _{\max }(P_{X',T^c T^c}^\perp )\) is no less than the \((t+1)^{\text {st}}\) largest eigenvalue of \(P_{X'}^\perp\), which is a projection matrix and therefore has \(n-p\) eigenvalues equal to 1. Thus, if \(t+1 \le n-p\), i.e., \(n\ge t + p + 1\), then \(\Vert P_{X',T^c T^c}^\perp \Vert _2 = 1\).
Now that we have bounded the number of iterations, we consider probability that the statement holds. Note that \(\epsilon '\) is sub-Gaussian and all the statements based on \(\lambda (\sigma ^*)\) hold with probability \(1-\frac{1}{n-t}\). Compared to Theorem 3, note that on each iteration, we have subset support recovery with probability \(1-\frac{1}{n-t}\); and on iteration \(\log _2 \frac{\lambda _u}{\lambda (\sigma ^*)}\), we have exact support recovery with probability \(1-\frac{1}{n-t}\). Thus, we conclude that Algorithm 1 outputs a value of \(\widehat{\lambda }\) that achieves exact recovery with probability at least
1.3 Proof of Proposition 2
We consider the three cases in Appendices D.3.1, D.3.2, and D.3.3.
Let \(\varSigma = {\mathbb {E}}[x_i x_i^\top ]\) and \(\varTheta = \varSigma ^{-1}\), and assume that \(X^{(k)}\) corresponds to some \(X_S\) with rows indexed by S. Our goal is to prove that
for at most \(\log _2 \frac{\lambda _u}{\lambda ^*}\) of such sets S. Note that \(T^c \subseteq S \subseteq [n]\) holds with probability at least \(1 - \frac{\log _2 \frac{\lambda _u}{\lambda ^*}}{n-t}\).
1.3.1 D.3.1 Proof of Proposition 2 for Gaussian case
The spectral norm bound follows from standard results (Vershynin 2010), which holds for a fixed set S with probability at least \(1-e^{-|S|} \ge 1 - e^{-(n-t)}\). Note that Algorithm 1 runs for at most \(\log _2 \frac{\lambda _u}{\lambda ^*}\) iterations by Theorem 3. Taking a union bound over all sets S, we obtain an overall probability of \(1 - \log _2 \frac{\lambda _u}{\lambda ^*} \, e^{-(1-c_t)n} \ge 1 - e^{-\frac{n}{2} + \log \log _2 \frac{\lambda _u}{\lambda ^*}}\).
We now consider (63). Define \(z_i = \varTheta ^{1/2}x_i\) for \(1 \le i \le n\), so that
We know the \(\varTheta ^{1/2}x_i\)’s are i.i.d. isotropic Gaussian random vectors. Hence, \(z_i^\top z_i \sim \chi ^2(p)\) satisfies
with probability at least \(1 - \delta\). Similarly, we can bound \(z_k^\top z_k\) and \((z_i+z_k)^\top (z_i + z_k)\). Since \(z_i^\top z_k = \frac{1}{2}[(z_i+z_k)^\top (z_i + z_k) - z_i^\top z_i - z_k^\top z_k]\), we then have
with probability at least \(1 - \delta\).
We now choose \(\delta = \frac{1}{n^{c}}\) for some \(c > 2\) and take a union bound over all \(n^2\) entries of the matrix \(X\varTheta X^\top\), to obtain
with probability at least \(1 - \frac{1}{n^{c'-2}}\), where \(c' > 2\).
Finally, note that for all \(S \subseteq [n]\), we have
1.3.2 D.3.2 Proof of Proposition 2 for sub-Gaussian case
By Lemma 14, inequality (64) holds for a fixed set S, with probability at least \(1-e^{-c|S|} \ge 1 - e^{-c(n-t)}\) for some \(c > 0\). Note that Algorithm 1 runs for at most \(\log _2 \frac{\lambda _u}{\lambda ^*}\) iterations. We then take a union bound over the possible subsets \(T^c \subseteq S \subseteq [n]\) to reach a probability of at least \(1 - \log _2 \frac{\lambda _u}{\lambda ^*} \, e^{-c(1-c_t)n} \ge 1 - e^{-\frac{cn}{2} + \log \log _2 \frac{\lambda _u}{\lambda ^*}}\).
Next, we focus on verifying inequality (63). Assuming that the \(x_i\)’s are independent random vectors and the components of the \(x_i\)’s are independent of each other, our goal is to prove that
w.h.p., where \(\varSigma = {\text {Cov}}(x_i) = \varTheta ^{-1}=: D^2\) is a diagonal matrix.
Define \(z_i = D^{-1}x_i\). Since the \(z_i\)’s are mutually independent with independent components, we know that the vector \(g_{ij} = (z_{i1}, ..., z_{ip}, z_{j1},...,z_{jp})^\top\), for \(i \ne j\), also has independent components. Furthermore, the sub-Gaussian parameter of \(g_{ij}\) is bounded by \(l_{\max }= \max _{q=1}^p \frac{K}{d_q^2}\), where K is the sub-Gaussian variance parameter of the \(x_i\)’s. This is because for a unit vector u, we have
Since we have assumed that \(\Vert \varSigma \Vert _2\) is bounded, the \(d_q\)’s are all bounded for each q, so \(l_{\max }\) is bounded, as well.
Now let \(A = \begin{bmatrix} 0_{p \times p} & \quad I_{p \times p} \\ 0_{p \times p} & \quad 0_{p \times p} \end{bmatrix}\). By the Hanson-Wright inequality, with probability at least \(1 -\delta\), we have
where \(c_1\) is a constant related to \(l_{\max }\).
Now applying the Hanson-Wright inequality to the vector \(z_i\), we have
with probability at least \(1- \delta\). Noting that \({\mathbb {E}}[\left\Vert z_i\right\Vert _2^2] = tr (\varTheta \varSigma ) = p\), we will finally have
Plugging in \(\delta = \frac{2}{n^3}\) and taking a union bound, we then conclude that
with probability at least \(1 - \frac{2}{n}\).
1.3.3 D.3.3 Proof of Proposition 2 for convex concentration case
Recall the following definition:
Definition 2
(Convex concentration property) Let X be a random vector in \({\mathbb {R}}^d\). If for every 1-Lipschitz convex function \(\varphi : {\mathbb {R}}^d \rightarrow {\mathbb {R}}\) such that \({\mathbb {E}}[\varphi (X)]<\infty\) and for every \(t>0\), we have
then X satisfies the convex concentration property with constant K.
Suppose \(x_i\) has the convex concentration property with parameter K. Note that
By Lemma 13, we thus have the exponential tail bound
for all \(1 \le i \le p\), which implies that
with probability at least \(1- \delta\). Taking \(\delta = 2/n^3\), we then obtain
with probability at least \(1 - \frac{2}{n^3}\).
Now we consider the off-diagonals \(\frac{x_i \varTheta x_j}{p}\), for \(i \ne j\). We first rewrite
Conditioning on \(\left\Vert \varTheta x_j \right\Vert _2\) for some \(w > 0\), we obtain
Since we have a convex 1-Lipschitz function mapping from \(x_i\) to \(x_i^\top \frac{ \varTheta x_j }{\left\Vert \varTheta x_j \right\Vert _2}\), we can further upper-bound the probability using the convex concentration property:
where (1) and (3) use the convex concentration property and (2) uses Jensen’s inequality. The last inequality assumes that \(w \ge \sqrt{p \Vert \varSigma \Vert _2}\), can be guaranteed if we choose w sufficiently large.
Plugging \(\varDelta = c\max \left\{ \frac{\log n}{p}, \sqrt{\frac{\log n}{p}}\right\}\) and \(w = c' \left( \sqrt{p} + \sqrt{\log n}\right)\) into the above derivations, we then obtain
If \(p > \log n\), then \(2 \exp \left( -\frac{\max \{(\log n)^2, p\log n \}}{(p + \log n) K^2}\right) \le 2 \exp \left( -\frac{ c''''\log n }{ K^2}\right)\); If \(p \le \log n\), then \(2 \exp \left( -\frac{\max \{(\log n)^2, p\log n \}}{(p+ \log n) K^2}\right) \le 2 \exp \left( -\frac{ c'''''\log n }{ K^2}\right)\). Hence, we have
We can choose c and \(c'\) sufficiently large to ensure that \(C > 2\). Combining this with inequality (67) using a union bound, we finally obtain the desired result.
1.4 D.4 Auxiliary lemmas
By Theorem 1, we have the following corollary:
Corollary 2
For two data pools, suppose the eigenvalue and mutual incoherence conditions hold. Let \(\lambda \ge \lambda (\sigma ^*).\) Then with probability \(1- \frac{1}{n-t}\), we have \({\text {supp}}(\widehat{\gamma }) \subseteq {\text {supp}}(\gamma ^*)\), and
Proof
Recall that the rule for regularizer selection in Theorem 1 is
Note that \(e_j^\top \bar{P}_{T^c}^\top \left( I-\bar{P}_T (\bar{P}_T^\top \bar{P}_T)^{-1}\bar{P}_T^\top \right) \frac{\epsilon '}{n}\) is sub-Gaussian with variance parameter \(\max \{1,\frac{\eta n}{mL}\}\frac{\Vert \bar{P}_{T^c}^\perp \Vert _2^2\sigma ^{*2}}{n^2}\). We have
with probability at least \(1 - \frac{1}{n-t}\). According to the definition of \(\lambda (\sigma ^*)\), we can further derive the bound for \(\widehat{\gamma }\), since
\(\square\)
The following lemma suggests that if \(\min _{i \in T} |\gamma ^*_i| \ge G'(2\lambda ^*)\), then \({\text {supp}}(\widehat{\gamma }(\lambda )) = {\text {supp}}(\gamma ^*)\) if we take \(\lambda \in [\lambda ^*, 2\lambda ^*]\).
Lemma 6
If \(\min _{i \in T} |\gamma ^*_i| \ge G'(2\lambda ^*)\), then taking \(\lambda \in [\lambda ^*,2\lambda ^*]\) yields an estimator \(\widehat{\gamma }(\lambda )\) that satisfies \({\text {supp}}(\widehat{\gamma }(\lambda )) = {\text {supp}}(\gamma ^*)\).
Proof
According to Theorem 1, for a regularizer \(\lambda \in [\lambda ^*, 2\lambda ^*]\), we have \(\widehat{\gamma }_{T^c} = 0\) and \(\left\Vert \widehat{\gamma }(\lambda ) - \gamma ^*\right\Vert _{\infty } \le G'(\lambda )\). If \(\min _{i\in T} |\gamma ^*_i| \ge G'(2\lambda ^*)\), then by the triangle inequality, we have
for all \(i \in T\). \(\square\)
We use \(X_S\) to represent some \(X^{(k)}\) for \(S \subseteq [n]\), as shown in Algorithm 3. In each loop of the algorithm, we know that the points in \(S^c\) all lie in T by the subset recovery result. Thus, \(S \supseteq T^c\). Let \(l = n -|S|\), and note that \(0 \le l \le t\).
Lemma 7
Suppose Assumption 4holds. If \(\lambda _{\min }(\varSigma )\) and \(\lambda _{\max }(\varSigma )\) are bounded, then
Proof
Using the notation \(\varTheta = \varSigma ^{-1}\) and \(\widehat{\varSigma }= \frac{X_S^\top X_S}{|S|}\), we have
By assumption, we may bound the second term by
For the first term, we have
We now have the bound
as well, where the second inequality holds by Weyl’s Theorem (Horn and Johnson 1994): \(\lambda (\widehat{\varSigma }) \ge \lambda (\varSigma ) - \Vert \varSigma -\widehat{\varSigma }\Vert _2\). The basic idea for the first inequality is to use the multiplicativity of matrix norms to conclude that
Hence, an upper bound on \(\left\Vert A-B\right\Vert _2\)—which we obtain from our assumptions—together with minimum eigenvalue bounds on A and B, implies an upper bound on \(\left\Vert A^{-1} - B^{-1}\right\Vert _2\).
Finally, we have
By assumption, we have
Hence, rescaling and using the triangle inequality, we have
Altogether, we have the bound
Finally, we have
This finishes the proof. \(\square\)
We use \(\alpha (k)\) to represent the kth order statistics of \(|\epsilon _i|\), for \(i \in T^c\), where \(\alpha _{(1)} \le \alpha _{(2)} \le \cdots \le \alpha _{(n-t)}\).
Lemma 8
For i.i.d. random variables \(\{|\epsilon _i|\}_{i \in T^c}\), the kth order statistics, for any \(k \in \{\frac{n-t}{2}, \dots , \frac{n}{2} \}\) satisfy
with probability at least \(1-2\exp \left( -2\left( \frac{1}{2}-c_t-\nu \right) ^2n\right)\), for \(\nu \in (0,\frac{1}{2})\) such that \(\nu < \frac{1}{2} - c_t\).
Proof
By the assumptions on the noise distribution, we have
Let \(\xi _i\)’s be i.i.d. Bernoulli variables such that
Note that \(t = c_t n\) for some positive constant \(c_t \in (0,\frac{1}{2})\). We have
and
By Hoeffding’s inequality (Hoeffding 1994), we then obtain
implying that
Similarly, let \(\eta _i\)’s be i.i.d. Bernoulli variables such that
Note that the assumption that \(c_t < \frac{1}{2} - \nu\) gives us
and
Then by Hoeffding inequality, we obtain
so that
\(\square\)
Lemma 9
Suppose the assumptions of Lemma 7hold and
and
for some constant \(c_n \in (0,\frac{1}{2})\). Then the \(kth\) order statistic of \(|P_{X_{S}}^\perp (\gamma ^*_S + \epsilon _{S})|\) and the \(kth\) order statistic of \(\left| \left( 1-\frac{p}{|S|}\right) (\gamma ^*_S+\epsilon _{S})\right|\) have differences of at most \(\frac{{\bar{c}}}{4}\sigma ^*,\) for any \(k \in [|S|]\), with probability at least \(1-\frac{1}{n-t}\).
Proof
Recall that \(l = n - |S|\). Now consider the sequences \(\{z_i = |e_i^\top P_{X_S}^\perp (\gamma ^*_S+\epsilon _S)|\}_{i=1}^{n-l}\) and \(\left\{ w_i = \left| \left( 1-\frac{p}{n-l}\right) (\gamma ^*_{S,i}+\epsilon _{S,i})\right| \right\} _{i=1}^{n-l}\). By the triangle inequality, we have
for \(i =1,\dots ,n-l\).
Since \(u_i\) is sub-Gaussian with parameter at most \(\left\Vert (P_{X_S}^\perp )_{i\cdot }-e_i^\top \left( 1-\frac{p}{n-l}\right) \right\Vert _2^2\sigma ^{*2}\), we can upper-bound the maximum of \(\{|u_i|\}\). With probability at least \(1- \frac{1}{n-t}\), we have
where the last inequality follows by Lemma 7. Further note that since \(n^{1-2c_n} \ge \frac{32C^2}{1-c_t}\log (2n)\, (p^2+ \log ^2 n)\) for some \(c_n \in (0,\frac{1}{2})\), we have \(\max _{i\in S}{|u_i|} \le \frac{1}{n^{c_n}}\sigma ^*\) .
For the \(v_i\)’s, we have
where (i) holds because \(|a^\top \gamma ^*_S| \le \Vert a\Vert _\infty \Vert \gamma ^*_S\Vert _\infty |{\text {supp}}(\gamma ^*_S)|\) for any vector a, (ii) holds by Lemma 7, and (iii) holds by our assumption on n. Combining this with the assumption that \(\max _{i \in S} |\gamma ^*_S| \le \frac{c_\nu C}{4}\sqrt{1-c_t} \sqrt{ \log 2n}\, \frac{n^{1/2+c_n}}{t} \sigma ^*\), we obtain \(\max _{i\in S}|v_i| \le \frac{c_\nu }{8}\sigma ^*\). Finally, using the fact that \(n \ge \left( \frac{24}{c_\nu }\right) ^{\frac{1}{c_n}}\), we obtain
with probability at least \(1-\frac{1}{n-t}\).
We then use the following lemma:
Lemma 10
For two sequences \(a_1,\dots ,a_n\) and \(b_1, \dots ,b_n\) such that \(|a_i - b_i| \le c\) for some positive number c, the \(jth\) order statistics of \(\{a_i\}\) and \(\{b_i\}\), denoted by \(\alpha _a(j)\) and \(\alpha _b(j)\), satisfy
Proof
Without loss of generality, suppose \(a_1 \le a_2 \le \cdots \le a_n\). If there exists \(j \in [n]\) such that inequality (71) does not hold, then we have either \(a_j > c+\alpha _b(j)\) or \(a_j < \alpha _b(j) -c\). If the first case occurs, we have
Pick a number z between \(c+\alpha _b(j)\) and \(a_j\). We see that at least j of the \(b_i\)’s, denoted by \(\mathbf {b}_\downarrow\), are smaller than \(z-c\); and at least \(n-j+1\) of \(a_i\)’s, denoted by \(\mathbf {a}_\uparrow\), are greater than z. This means that at most \(j-1\) of \(a_i\)’s are no larger than z. Note that for the \(\mathbf {b}_\downarrow\), the components of the corresponding vector \(\mathbf {a}_\downarrow\) are within a distance of c, so the elements of \(\mathbf {a}_\downarrow\) must be at most z. However, this contradicts the fact that at most \(j-1\) of the \(a_i\)’s are at most z. This concludes the proof. \(\square\)
From Lemma 10, we can compare the order statistics of sequences \(\{z_i\}_{i=1}^n\) and \(\{w_i\}_{i=1}^n\) and conclude that they have differences of at most \(\frac{{\bar{c}}}{6}\sigma ^*\), with probability at least \(1-\frac{1}{n-t}\). \(\square\)
Lemma 11
Suppose the conditions of Lemmas 8and 9hold, and also \(\min _{i\in T} |\gamma ^*_i| > 4 \sqrt{\log (2n)}\sigma ^*\). Then
with probability at least \(1 - 2\exp \left( -2\left( \frac{1}{2}-c_t-\nu \right) ^2n\right) - \frac{2}{n-t}\).
Proof
Let \(M_P (S)\) denote the median of \(|P_{X_{S}}^\perp (\gamma ^*_S + \epsilon _{S})|\). By Lemma 9, we know that \(M_P(S)\) is close to the median of \(\left| \left( 1-\frac{p}{|S|}\right) (\gamma ^*_S+\epsilon _S)\right|\). Thus, it remains to analyze the median of \(\{|\gamma ^*_i + \epsilon _i|\}_{i\in S}\).
Note that for \(j \in T^c\), we have \(|\gamma ^*_j + \epsilon _j| = |\epsilon _j|\). Therefore, for all \(j \in S \cap T^c = T^c\), we have \(|\gamma ^*_j + \epsilon _i|_\infty \le 2\sqrt{\log 2n}\, \sigma ^*\), with probability at least \(1-\frac{1}{n}\).
For \(i \in T \cap S\), by the assumption that \(\min _{i \in T}|\gamma ^*_i| > 4\sqrt{\log 2n}\, \sigma ^*\), we have \(|\gamma ^*_i + \epsilon _i| \ge |\gamma ^*_i| - |\epsilon _i| > 2\sqrt{\log 2n}\, \sigma ^*\). Therefore, the median of \(|\gamma ^*_S + \epsilon _S|\) is actually the \(kth\) order statistics of \(|\epsilon _{T^c}|\) for some \(\{k \in \frac{n-t}{2}, \dots , \frac{n}{2} \}\). By Lemma 9, we have
In Algorithm 1, at some iteration k, we have \(\widehat{\sigma }= \frac{|S|}{|S|-p} M_P(S)\), where S is the corresponding set of indices of \(\left( {\text {supp}}(\widehat{\gamma }^{(k)})\right) ^c\). Thus,
Combining this with Lemma 8, we have
with probability at least \(1 - 2\exp \left( -2\left( \frac{1}{2}-c_t-\nu \right) ^2n\right) - \frac{2}{n-t}\). \(\square\)
Lemma 12
Suppose \(n \ge 12p\),
and inequality (70) holds. Then
and for any \(\gamma ^*_S\) such that \(S\cap T \ne \emptyset\), we have
with probability at least \(1 - \frac{3}{n-t} - 2\exp \left( -2\left( \frac{1}{2}-c_t-\nu \right) ^2n\right)\).
Proof
We first establish the bound on \(\Vert P_{X_{T^c}}^\perp \epsilon _{T^c}\Vert _\infty\). Note that \(e_j^\top P_{X_{T^c}}^\perp \epsilon _{T^c}\) is Gaussian with variance at most \(\max \limits _{j\in T^c}(P_{X_{T^c}}^\perp )_{jj}\), so
with probability at least \(1-\frac{1}{n-t}\). In addition, Lemma 11 implies that
For \(n \ge 12 p\), we therefore conclude the bound (72).
Now consider \(\gamma ^*_S\) with nonzero elements, i.e., \(S \supset T^c\). We have
with probability at least \(1 - \frac{1}{n-t}\). We now split \(P_{X_S}^\perp\) into \(P_{X_S}^\perp - (1-\frac{p}{n-l})I\) and \((1-\frac{p}{n-l})I\). By the triangle inequality, we have
Plugging this into the result from inequality (70), we then obtain
Therefore, we have
By the assumption that \(n \ge 12p\) and Lemma 11, we then obtain
Thus, \(\Vert P_{X_S}^\perp (\gamma ^*_S+\epsilon _S)\Vert _\infty \ge \frac{5}{2{\bar{c}}}\sqrt{\log 2n}\, \widehat{\sigma }\) if \(\min \limits _{i \in T}|\gamma ^*_i|\) satisfies
This can be further achieved according to Lemma 11 if
Also note that by the assumption of \(\min _{i\in T}|\gamma _i|\), we have
This concludes the proof. \(\square\)
Lemma 13
(Theorem 2.5 in Adamczak (2015)) Suppose X is a zero-mean random vector in \(\mathbb {R}^n\) satisfying the convex concentration property with constant K. Then for any fixed matrix \(A \in \mathbb {R}^{n \times n}\) and any \(w > 0\), we have
Lemma 14
Suppose \(X \in \mathbb {R}^{n \times p}\) has i.i.d. rows from a zero-mean distribution satisfying the convex concentration property with constant K. Then
with probability at least \(1- \exp (-n)\).
Proof
Note that for any fixed unit vector \(u \in \mathbb {R}^p\), the map \(\varphi : x \mapsto \langle x , \, u \rangle\) is convex and 1-Lipschitz. Hence, by the definition of the convex concentration property, each \(x_i^\top u\) is sub-Gaussian with parameter proportional to K. In fact, this is enough to show the desired matrix concentration result [cf. Vershynin (2010)]. We omit the details. \(\square\)
E Appendix for Sect. 5
In this sectopm, we provide proofs and additional details for the results in Sect. 5.
1.1 E.1 Proof of Theorem 4
We will prove a stronger results here, which implies Theorem 4. This is actually mentioned by Remark 2.
Theorem 6
With respect to D, the bug generator, who has attacking budgets no more than t, cannot fail the sign support recovery if only if (16) holds. That failure of sign support recovery, \({{\,\mathrm{sign}\,}}(\widehat{\gamma }) \ne {{\,\mathrm{sign}\,}}(\gamma ^*)\), means either \(\widehat{\gamma }_j \ne 0\) for some \(j\in T^c\) or \(\widehat{\gamma }_i \gamma ^*_i \le 0\) for some \(i \in T\).
Proof of Theorem 4
We will use the following lemma to prove Theorem 4.
Lemma 15
The following two properties are equivalent:
- (a):
-
For any vector \(\gamma ^*\in \mathbb {R}^d\) with support K, the constraint-based optimization has all solutions \(\widehat{\gamma }\) satisfying \({{\,\mathrm{sign}\,}}(\widehat{\gamma }) = {{\,\mathrm{sign}\,}}(\gamma ^*)\).
- (b):
-
The matrix \(\bar{P}(D)\) satisfies the restricted nullspace property with respect to K.
Proof of Lemma 15
We first prove \((b) \implies (a)\). This immediately follows Theorem 7.8 in Wainwright (2019) since \((b) \implies \gamma ^*= \widehat{\gamma }\) for any vector \(\gamma ^*\) with \({\text {supp}}(\gamma ^*) = K\), it thus implies \((b) \implies {{\,\mathrm{sign}\,}}(\widehat{\gamma }) = {{\,\mathrm{sign}\,}}(\gamma ^*)\). Or we can show it directly as follow. Suppose (a) doesn’t hold. Then, we have \(\varDelta := \gamma ^*- \widehat{\gamma }\ne 0\). By the constraint and the objective, it also needs to satisfy that \(\varDelta \in Null(\bar{P}(D))\) and
Therefore, we have
which means a nonzero \(\varDelta \in Null(\bar{P}) \cap {\mathbb {C}}^A\) and causes a contradiction. Thus when (b) is true, (a) holds as well.
From now on to the end of the proof, we will abuse notation by using \(\bar{P}\) to represent \(\bar{P}(D)\). The remaining thing is to prove \((a) \implies (b)\). We will prove by contradiction. If (b) doesn’t hold, then there exists a nonzero \(\varDelta\) such that \(\bar{P}\varDelta = 0\) and \(\Vert \varDelta _{K^c}\Vert _1 \le \Vert \varDelta _K\Vert _1\). We consider a \(\gamma ^*\) with \(\gamma ^*_K = \varDelta _K\) and \(\gamma ^*_{K^c} = \mathbf {0}\). Let \(\widehat{\gamma }\) be the optimizer given this \(\gamma ^*\). By (a), we shall have \({{\,\mathrm{sign}\,}}(\widehat{\gamma }) = {{\,\mathrm{sign}\,}}(\gamma ^*) = {{\,\mathrm{sign}\,}}\left( \begin{bmatrix}\varDelta _K\\ \mathbf {0}_{(n-t)\times 1}\end{bmatrix}\right)\). The idea is to construct a \(\gamma '\) that has no larger \(\ell _1\) norm than \(\widehat{\gamma }\) and has support not equal to K, which contradicts with (a), and therefore, (b) must hold.
Consider \(\gamma ' = \widehat{\gamma }- c \cdot \varDelta\) where \(c = \frac{\widehat{\gamma }_i}{\varDelta _i}\) for \(i = \arg \min _{j\in K} \frac{\widehat{\gamma }_j}{\varDelta _j}\). Since \(\varDelta\) is a nonzero vector, we must have \(\varDelta _l \ne 0\) for some \(l \in K\). Therefore, we have c being positive finite, \(\gamma '_i = 0\) and \(|\widehat{\gamma }_j| \ge c |\varDelta _j|\) for all \(j \in K\). Therefore, we further get
as well as
where (i) is because \({{\,\mathrm{sign}\,}}(\widehat{\gamma }_K) = {{\,\mathrm{sign}\,}}(\varDelta _K), c > 0, |\widehat{\gamma }_K| \ge c|\varDelta _K|\) and \(\widehat{\gamma }_{K^c} = 0\), (ii) is because \(\varDelta \in {\mathbb {C}}(K)\). Hence, we find a \(\gamma '\) to have smaller or equal \(\ell _1\) norm than \(\widehat{\gamma }\). This contradicts with the fact that all the solutions have support K or \(\widehat{\gamma }\) is the optimal solution. Therefore, (b) must hold and \((a) \implies (b)\). \(\square\)
We first prove that (16) is sufficient. For any \(|K| \le t\) and \(K \subseteq [n]\), we know that \(Null(\bar{P}(D)) \cap {\mathbb {C}}(K) = \{0\}\). Then by Proposition 15, we conclude that \(sign(\widehat{\gamma }) = sign(\gamma ^*)\) with \({\text {supp}}(\gamma ^*) = K\) for any subset K of size no more than t.
We second prove that (16) is necessary. Note that for any subset K of size less equal to t, we have \({{\,\mathrm{sign}\,}}(\widehat{\gamma }) = {{\,\mathrm{sign}\,}}(\gamma ^*)\) with \({\text {supp}}(\gamma ^*) = K\). By Proposition 15, it means \(\bar{P}(D)\) satisfies the restricted nullspace property for any such K. Therefore \(Null(\bar{P}(D)) \cap {\mathbb {C}}^A = \{\mathbf {0}\}\). \(\square\)
Theorem 4 immediately holds from Theorem 6.
1.2 E.2 Proof of Remark 3
We will prove the statement in Remark 3 here.
Proposition 12
The subspace \(Null(\bar{P}(D))\) is equivalent to \(\{u \in \mathbb {R}^n \mid \exists v \in \mathbb {R}^p, \text{ such } \text{ that } u = Xv, X_Dv = 0\}\).
Proof of Proposition 12
We first prove \(Null(\bar{P}(D)) \supseteq \{u\in \mathbb {R}^n \mid \exists v \in \mathbb {R}^p, \text{ such } \text{ that } u = Xv, X_Dv = 0\}\). Let \(u = \left( X+M^\top X_D\right) v\) for some \(v \in \mathbb {R}^p\), where \(M \in \mathbb {R}^{m \times p}\) contains m rows stacked with the canonical vectors indexed by D so that \(MX = X_D\). We have
Besides, we have
Therefore \(X_Dv = 0, u = Xv \implies u \in Null(\bar{P}(D))\).
Secondly we prove \(Null(\bar{P}(D)) \subseteq \{u\mid \exists v \in \mathbb {R}^d, \text{ such } \text{ that } u = Xv, X_Dv = 0\}\). Let u be some vector in \({\mathbb {N}}(X_D)\). Then we have
and
By (75), we have \(\left( X^\top X + X_D^\top X_D \right) ^{-1}X^\top u = v\) for some \(v \in Null(X_D)\). Plugging this back to (74), we have \(u = Xv\). Hence, we have \(u \in \{u \mid \exists v \in \mathbb {R}^d, \text{ such } \text{ that } u = Xv, X_D v = 0\}\). \(\square\)
1.3 E.3 Proof of Theorem 5
Here we prove the proof of Theorem 5. We write the minimax MILP here again.
Proof of Theorem 5
We first argue that if (83) has the unique solution of \((u,v) = (\mathbf {0},\mathbf {0})\), then (16) holds and thus the debugger can add m points indexed by D to achieve support recovery.
Suppose (16) doesn’t hold. Then there exists \(K\subseteq [n], |K| \le t\) and a nonzero vector \(u'\) such that \(u'=Xv, X_Dv= 0\) and \(\Vert u'_K\Vert _1 \ge \Vert u'_{K^c}\Vert _1\). And \(\frac{u'}{\Vert u'\Vert _2}\) satisfies \(\Vert u'\Vert _\infty \le 1\). This contradicts with that (83) has the unique solution of \((u,v) = (\mathbf {0},\mathbf {0})\), then (16) holds. This concludes our first part of the proof.
Now we argue that the MILP is equivalent to (83). Equation (77) is inherited from original constraint. Equations in (78) are equivalent to \(a = |u|\). Note that \(u^+, u^-\) respectively correspond to the positive and negative parts of u. If \(z_i = 0\), then \(u_i^+ = 0\), \(u^-_i \le 1\) and \(u^-_i = -u_i\). If \(z_i = 1\), then \(u^-_i = 0\), \(u^+_i \le 1\) and \(u^+_i = u_i\). The vector w indicates K in (83). If \(w_i=1\), then \(i \in K\) otherwise \(i \in K^c\). Therefore, Eq. (79) restricts the attacking budget to t. Then, equations in (80) are equivalent to \(a_i^+ = |u_i|, a_i^- = 0\) for \(i \in K\) and \(a_i^- = |u_i|, a_i^+ = 0\) for \(i \in K^c\). Therefore, the objective function corresponds to \(\Vert u_K\Vert _1 - \Vert u_{K^c}\Vert _1\).
Note that the variable in the first layer is \(\xi\). If \(\xi _i = 1\), it means the debugger queries the point \(x_i\). And the constraint \(X_D v = 0\) is replaced by (82). This is because \(x_i^\top v = 0 \Leftrightarrow u_i = 0\). If \(\xi _j = 0\), then \(u_j\) just needs to satisfy \(|u_j| \le 1\).
Therefore, we have shown that the MILP is equivalent to (83) and thus conclude Theorem 5. \(\square\)
Rights and permissions
About this article
Cite this article
Zhang, X., Zhu, X. & Loh, PL. Provable training set debugging for linear regression. Mach Learn 110, 2763–2834 (2021). https://doi.org/10.1007/s10994-021-06040-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-021-06040-4