Next Article in Journal
The Lorenz Curve: A Proper Framework to Define Satisfactory Measures of Symbol Dominance, Symbol Diversity, and Information Entropy
Next Article in Special Issue
Multivariate Tail Coefficients: Properties and Estimation
Previous Article in Journal
Statistical Uncertainties of Space Plasma Properties Described by Kappa Distributions
Previous Article in Special Issue
Towards a Unified Theory of Learning and Information
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Prediction and Variable Selection in High-Dimensional Misspecified Binary Classification

1
Institute of Information Technology, Warsaw University of Life Sciences (SGGW), Nowoursynowska 159, 02-776 Warszawa, Poland
2
Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Chopina 12/18, 87-100 Toruń, Poland
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Entropy 2020, 22(5), 543; https://doi.org/10.3390/e22050543
Submission received: 20 April 2020 / Revised: 8 May 2020 / Accepted: 11 May 2020 / Published: 13 May 2020

Abstract

:
In this paper, we consider prediction and variable selection in the misspecified binary classification models under the high-dimensional scenario. We focus on two approaches to classification, which are computationally efficient, but lead to model misspecification. The first one is to apply penalized logistic regression to the classification data, which possibly do not follow the logistic model. The second method is even more radical: we just treat class labels of objects as they were numbers and apply penalized linear regression. In this paper, we investigate thoroughly these two approaches and provide conditions, which guarantee that they are successful in prediction and variable selection. Our results hold even if the number of predictors is much larger than the sample size. The paper is completed by the experimental results.

1. Introduction

Large-scale data sets, where the number of predictors significantly exceeds the number of observations, become common in many practical problems from, among others, biology or genetics. Currently, the analysis of such data sets is a fundamental challenge in statistics and machine learning. High-dimensional prediction and variable selection are arguably the most popular and intensively studied topics in this field. There are many methods trying to solve these problems such as those based on penalized estimation [1,2]. The main representative of them is Lasso [3], that relates to l 1 -norm penalization. Its properties in model selection, estimation and prediction are deeply investigated, among others, in [2,4,5,6,7,8,9,10]. The results obtained in the above papers can be applied only if some specific assumptions are satisfied. For instance, these conditions concern the relation between the response variable and predictors. However, it is quite common that a complex data set does not satisfy these model assumptions or they are difficult to verify, which leads to the fact that the considered model is specified incorrectly. The model misspecification problem is the core of the current paper. We investigate this topic in the context of high-dimensional binary classification (binary regression).
In the classification problem we are to predict or to guess the class label of the object on the basis of its observed predictors. The object is described by the random vector ( X , Y ) , where X R p is a vector of predictors and Y { 1 , 1 } is the class label of the object. A classifier is defined as a measurable function f : R p R , which determines the label of an object in the following way:
if f ( x ) 0 , then we predict that y = 1 .
Otherwise, we guess that y = 1 .
The most natural approach is to look for a classifier f , which minimizes the misclassification risk (probability of incorrect classification)
R f = P ( Y = 1 , f ( X ) < 0 ) + P ( Y = 1 , f ( X ) 0 ) .
Let η ( x ) = P ( Y = 1 | X = x ) . It is clear that f B ( x ) = sign ( 2 η ( x ) 1 ) minimizes the risk (1) in the family of all classifiers. It is called the Bayes classifier and we denote its risk as R B = R f B . Obviously, in practice we do not know the function η , so we cannot find the Bayes classifier. However, if we possess a training sample ( X 1 , Y 1 ) , , ( X n , Y n ) containing independent copies of ( X , Y ) , then we can consider a sample analog of (1), namely the empirical misclassification risk
1 n i = 1 n I ( Y i = 1 , f ( X i ) < 0 ) + I ( Y i = 1 , f ( X i ) 0 ) ,
where I is the indicator function. Then a minimizer of (2) could be used as our estimator.
The main difficulty in this approach lies in discontinuity of the function (2). It entails that finding its minimizer is computationally difficult and not effective. To overcome this problem, one usually replaces the discontinuous loss function by its convex analog ϕ : R [ 0 , ] , for instance the logistic loss, the hinge loss or the exponential loss. Then we obtain the convex empirical risk
Q ¯ ( f ) = 1 n i = 1 n ϕ ( Y i f ( X i ) ) .
In the high-dimensional case one usually obtains an estimator by minimizing the penalized version of (3). Those tricks have been successfully used in the classification theory and have allowed to invent boosting algorithms [11], support vector machines [12] or Lasso estimators [3]. In this paper we are mainly interested in Lasso estimators, because they are able to solve both variable selection and prediction problems simultaneously, while the first two algorithms are developed mainly for prediction.
Thus, we consider linear classifiers
f b ( x ) = b 0 + j = 1 p b j x j ,
where b = ( b 0 , b 1 , , b p ) R p + 1 . For a fixed loss function ϕ we define the Lasso estimator as
b ^ = arg min b R p + 1 Q ¯ ( f b ) + λ j = 1 p | b j | ,
where λ is a positive tuning parameter, which provides a balance between minimizing the empirical risk and the penalty. The form of the penalty is crucial, because its singularity at the origin implies that some coordinates of the minimizer b ^ are exactly equal to zero, if λ is sufficiently large. Thus, calculating (5) we simultaneously select significant predictors in the model and we estimate their coefficients, so we are also able to predict the class of new objects. The function Q ¯ ( f b ) and the penalty are convex, so (5) is a convex minimization problem, which is an important fact from both practical and theoretical points of view. Notice that the intercept b 0 is not penalized in (5).
The random vector (5) is an estimator of
b * = arg min b R p + 1 Q ( f b ) ,
where Q ( f b ) = E ϕ ( Y f b ( X ) ) . In this paper we are mainly interested in minimizers (6) corresponding to quadratic and logistic loss functions. The latter has a nice information-theoretic interpretation. Namely, it can be viewed as the Kullback–Leibler projection of unknown η on logistic models [13]. The Kullback–Leibler divergence [14] plays an important role in the information theory and statistics, for instance it is involved in information criteria in model selection [15] or in detecting influenctial observations [16].
In general, the classifier corresponding to (6) need not coincide with the Bayes classifier. Obviously, we want to have a “good” estimator, which means that its misclassification risk should be as close to the risk of the Bayes classifier as possible. In other words, its excess risk
E ( b ^ , f B ) = E D R ( b ^ ) R B
should be small, where E D is the expectation with respect to the data D = { ( X 1 , Y 1 ) , , ( X n , Y n ) } and we write simply R ( b ) instead of R ( f b ) . Our goal is to study the excess risk (7) for the estimator (5) with different loss functions ϕ . We do it by looking for the upper bounds of (7).
In the excess risk (7) we compare two misclassification risks defined in (1). In the literature one can also find a different approach, which replaces the misclassification risks R ( · ) in (7) by the convex risks Q ( · ) . In that case the excess risk depends on the loss function ϕ . To deal with this fact one uses the results from [17,18], which state the relation between the excess risk (7) and its analog based on the convex risk Q ( · ) . In this paper we do not follow this way and work, right from the beginning, with the excess risk independent of ϕ . Only the estimator (5) depends on the loss ϕ .
In this paper we are also interested in variable selection. We investigate this problem in the following semiparametric model
η ( x ) = g ( β 0 + j = 1 p β j x j ) ,
where η ( x ) = P ( Y = 1 | X = x ) , β R p + 1 is the true parameter and g is unknown function. Thus, we suppose that predictors influence class probability through the function g of the linear combination β 0 + j = 1 p β j x j . The goal of variable selection is the identification of the set of significant predictors
T = { 1 j p : β j 0 } .
Obviously, in the model (8) we cannot estimate an intercept β 0 and we can identify the vector ( β 1 , , β p ) only up to a multiplicative constant, because any shift or scale change in β 0 + j = 1 p β j X j can be absorbed by g. However, we show in Section 5 that in many situations the Lasso estimator (5) can properly identify the set (9).
The literature on the classification problem is comprehensive. We just mention a few references: [12,19,20,21]. The predictive quality of classifiers is often investigated by obtaining upper bounds for their excess risks. It is an important problem and was studied thoroughly, among others in [17,18,22,23,24]. The variable selection and predictive properties of estimators in the high-dimensional scenario were studied, for instance, in [2,10,13,25,26]. In the current paper we investigate the behaviour of classifiers in possibly misspecified high-dimensional classification, which appears frequently in practice. For instance, while working with binary regression one often assumes incorrectly that the data follow the logistic regression model. Then the problem is solved using the Lasso penalized maximum likelihood method. Another approach to binary regression, which is widely used due to its computational simplicity, is just treating labels Y i as they were numbers and applying standard Lasso. For instance, such method is used in ([1], [Subsections 4.2 and 4.3]) or ([2], Subsection 2.4.1). These two approaches to classification sometimes give unexpectedly good results in variable selection and prediction, but the reason of this phenomenon has not been deeply studied in the literature. Among the above mentioned papers only [2,13,25] take up this issue. However, [25] focuses mainly on the predictive properties of Lasso classifiers with the hinge loss. Bühlmann and van de Geer [2] and Kubkowski and Mielniczuk [13] study general Lipschitz loss functions. The latter paper considers only the variable selection problem. In [2] one also investigates prediction, but they do not study classification with the quadratic loss.
In this paper we are interested in both variable selection and predictive properties of classifiers with convex (but not necessarily Lipschitz) loss functions. The prominent example is classification with the quadratic loss function, which has not been investigated so far in the context of the high-dimensional misspecified model. In this case the estimator (5) can be calculated efficiently using the existing algorithms, for instance [27] or [28], even if the number of predictors is much larger than the sample size. It makes this estimator very attractive, while working with large data sets. In [28] one provides also the efficient algorithm for Lasso estimators with the logistic loss in the high-dimensional scenario. Therefore, misspecified classification with the logistic loss plays an important role in this paper as well. Our goal is to study thoroughly such estimators and provide conditions, which guarantee that they are successful in prediction and variable selection.
The paper is organized as follows: in the next section we provide basic notations and assumptions, which are used in this paper. In Section 3 we study predictive properties of Lasso estimators with different loss functions. We will see that these properties depend strongly on the estimation quality of estimators, which is studied in Section 4. In Section 5 we consider variable selection. In Section 6 we show numerical experiments, which describe the quality of estimators in practice. The proofs and auxiliary results are relegated to Appendix A.

2. Assumptions and Notation

In this paper we work in the high-dimensional scenario p > > n . As usual we assume that the number of predictors p can vary with the sample size n , which could be denoted as p ( n ) = p n . However, to make notation simpler we omit the lower index and write p istead of p n . The same refers to the other objects appearing in this paper.
In the further sections we will need the following notation:
-
X i = ( X i 1 , X i 2 , , X i p ) ;
-
X = ( X 1 , X 2 , , X n ) is the ( n × p ) -matrix of predictors;
-
Let A { 1 , , p } . Then A c = { 1 , , p } \ A is a complement of A;
-
X A is a submatrix of X , with columns whose indices belong to A;
-
b A is a restriction of a vector b R p to the indices from A ;
-
| A | is the number of elements in A ;
-
A ˜ = A { 0 } , so the set A ˜ contains indices from A and the intercept;
-
The l q -norm of a vector is defined as | b | q = j = 1 p | b j | q 1 / q for q [ 1 , ] ;
-
For x R p we denote x ˜ = ( 1 , x ) ;
-
X ˜ is the matrix X with the column of ones binded from the left side;
-
b ^ q u a d , b * q u a d are minimizers in (5), (6), respectively, with the quadratic loss function;
-
b ^ l o g , b * l o g are minimizers in (5), (6), respectively, with the logistic loss function;
-
The Kullback–Leibler (KL) distance [14] between two binary distributions with success probabilities π 1 and π 2 is defined as
K L ( π 1 , π 2 ) = π 1 log π 1 π 2 + ( 1 π 1 ) log 1 π 1 1 π 2 .
Obviously, we have K L ( π 1 , π 2 ) 0 and K L ( π 1 , π 2 ) = 0 if only if π 1 = π 2 . Moreover, the K L distance need not be symmetric;
-
the set of nonzero coefficients of b * q u a d is denoted as
T = { 1 j p : ( b * q u a d ) j 0 } .
Notice that the intercept is not contained in (11) even if it is nonzero.
We also specify assumptions, which are used in this paper.
Assumption 1.
We assume that ( X , Y ) , ( X 1 , Y 1 ) , , ( X n , Y n ) are i.i.d. random vectors. Moreover, predictors are univariate subgaussian, i.e., for each a R and j { 1 , , p } we have E exp ( a X j ) exp ( σ j 2 a 2 / 2 ) for positive numbers σ j . We also denote σ = max 1 j p σ j . Finally, we suppose that the matrix H = E [ X X ] is positive definite and H j j = 1 for j = 1 , , p .
In Section 4 and Section 5 we need stronger version of Assumption 1.
Assumption 2.
We suppose that the subvector of predictors X T is subgaussian with the coefficient σ 0 > 0 , i.e., for each u R | T | we have E exp ( u X T ) exp ( σ 0 2 u H T u / 2 ) , where H T = E [ X 1 j X 1 k ] j , k T . The remaining conditions are as in Assumption 1. We also denote σ = max ( σ 0 , σ j , j T ) .
Subgaussianity of predictors is a standard assumption while working with random predictors in high-dimensional models, cf. [13]. In particular, Assumption 1 implies that E [ X ] = 0 and σ 1 [29].

3. Predictive Properties of Classifiers

In this part of the paper we study prediction properties of classifiers with convex loss functions. To do it we look for upper bounds of the excess risk (7) of estimators.
As usual the excess risk in (7) can be decomposed as
E D R ( b ^ ) R ( b * ) + R ( b * ) R B .
The second term in (12) is the approximation risk and compares the predictive abilitity of the “best” linear classifier (6) to the Bayes classifier. The first term in (12) is called the estimation risk and describes how the estimation process influences the predictive property of classifiers.
In the next theorem we bound from above the estimation risk of classifiers. To make the result more transparent we use notations P D and P X in (13), which indicate explicitly which probability we consider, i.e., P D is probability with respect to the data D and P X is with respect to the new object X. In further results we omit these lower indexes and believe that it does not lead to confusion.
Theorem 1.
For c > 0 we consider an event Ω = { | b ^ b * | 1 c } . We have
E D R ( b ^ ) R ( b * ) 2 P D ( Ω c ) + P X ( | b * X ˜ | c | X ˜ | ) .
In Theorem 1 we obtain the upper bound for the estimation risk. This risk becomes small, if we establish that probability of the event Ω c is small and the sequence c , which is involved in Ω and in the second term on the right-hand side of (13), decreases sufficiently fast to zero. Therefore, Theorem 1 shows that to have a small estimation risk it is enough to prove that for each ε ( 0 , 1 ) there exists c such that
P ( | b ^ b * | 1 c ) 1 ε .
Moreover, numbers ε and c should be sufficiently small. This property will be studied thoroughly in the next section. Notice that the first term on the right-hand side of (13) relates to the fact, how well (5) estimates (6). Moreover, the second expression on the right-hand side of (13) can be bounded from above, if predictors are sufficiently regular, for instance subgaussian.
So far, we have been interested in the estimation risk of estimators. In the next result we establish the upper bound for the approximation risk as well. This bound combined with (13) enables us to bound from above the excess risk of estimators. We prove this fact for the quadratic loss ϕ ( t ) = ( 1 t ) 2 and the logistic loss ϕ ( t ) = log ( 1 + e v ) , which play prominent roles in this paper.
Theorem 2.
Suppose that Assumption 1 is fulfilled. Moreover, a random variable b * X ˜ has a density h , which is continuous on the interval U = [ 2 σ c log p , 2 σ c log p ] and h ˜ = sup u U h ( u ) .
(a) We have
E ( b ^ q u a d , f B ) 2 P ( Ω c ) + 4 σ h ˜ q u a d c log p + 2 / p
+ E 2 η ( X ) 1 ( b * q u a d ) X ˜ 2 ,
where h ˜ q u a d refers to the density h of ( b * q u a d ) X ˜ .
(b) Let η l o g ( u ) = 1 / ( 1 + exp ( u ) ) . Then we obtain
E ( b ^ l o g , f B ) 2 P ( Ω c ) + 4 σ h ˜ l o g c log p + 2 / p
+ 2 E K L η ( X ) , η l o g ( ( b * l o g ) X ˜ )
where K L ( · , · ) is the Kullback–Leibler distance defined in (10) and h ˜ l o g refers to the density h of ( b * l o g ) X ˜ . Additionally, assuming that there exists δ ( 0 , 1 ) such that δ η ( X ) 1 δ and δ η l o g ( ( b * l o g ) X ˜ ) 1 δ , we have
E K L η ( X ) , η l o g ( ( b * l o g ) X ˜ ) ( 2 δ ( 1 δ ) ) 1 E η ( X ) η l o g ( ( b * l o g ) X ˜ ) 2 .
In Theorem 2 we establish upper bounds on the excess risks for Lasso estimators (5). They describe predictive properties of these classifiers. In this paper we consider linear classifiers, so the misclassification risk of an estimator is close to the Bayes risk, if the “truth” can be approximated linearly in a satisfactory way. For the classifier with the logistic loss this fact is described by (18) and (19), which measure the distance between true success probability and the one in logistic regression. In particular, when the true model is logistic, then (18) and (19) vanish. The expression (16) relates to the approximation error in the case of the quadratic loss. It measures how well the conditional expectation E [ Y | X ] can be described by the “best” (with respect to the loss ϕ ) linear function ( b * q u a d ) X ˜ .
The right-hand sides of (15) and (17) relate to estimation risk. They have been already discussed after Theorem 1. Using subgaussianity of predictors we have made them more explicit. The main ingredient of bounds in Theorem 2, namely P ( Ω c ) , is studied in the next section.
Results in Theorem 2 refer to Lasso estimators with quadratic and logistic loss functions. Similar results are given in ([2], Theorem 6.4). They refer to the case that the convex excess risk is considered, i.e., the misclassification risks R ( · ) are replaced by the convex risks Q ( · ) in (7). Moreover, these results do not consider Lasso estimators with the quadratic loss applied to classification, which is an approach playing a key role in the current paper. Furthermore, in ([2], Theorem 6.4) the estimation error b ^ b * is measured in the l 1 -norm, which is enough for prediction. However, for variable selection the l -norm gives better results. Such results will be established in Section 4 and Section 5. Finally, results of [2] need more restrictive assumptions than ours. For instance, predictors should be bounded and a function f b * should be sufficiently close to f B in the supremum norm.
Analogous bounds to those in Theorem 2 can be obtained for other loss functions, if we combine Theorem 1 with results of [17]. Finally, we should stress that the estimator b ^ need not rely on the Lasso method. All we require is that the bound (14) can be estiblished for this estimator.

4. On the Event Ω

In this section we show that probability of the event Ω can be close to one. Such results for classification models with Lipschitz loss functions were established in [2,13]. Therefore, we focus on the quadratic loss function, which is obviously non-Lipschitz. This loss function is important from the practical point of view, but was not considered in these papers. Moreover, in our results the estimation error in Ω can be measured in the l q -norms, q 1 , not only in the l 1 -norm as in [2,13]. Bounds in the l -norm lead to better results in variable selection, which are given in Section 5.
We start with introducing the cone invertibility factor (CIF), which plays a significant role in investigating properties of estimators based on the Lasso penalty [9]. In the case n > p one usually uses the minimal eigenvalue of the matrix X X / n to express the strength of correlations between predictors. Obviously, in the high-dimensional scenario this value is equal to zero and the minimal eigenvalue needs to be replaced by some other measure of predictors interdependency, which would describe the potential of consistent estimation of model parameters.
For ξ > 1 we define a cone
C ( ξ ) = { b R p + 1 : | b T c | 1 ξ | b T ˜ | 1 } ,
where we recall that T ˜ = T { 0 } . In the case when p > > n three different characteristics measuring the potential for consistent estimation of the model parameters have been introduced:
-
The restricted eigenvalue [8]:
R E ( ξ ) = inf 0 b C ( ξ ) b X ˜ X ˜ b / n | b | 2 2 ,
-
The compatibility factor [7]:
K ( ξ ) = inf 0 b C ( ξ ) | T | b X ˜ X ˜ b / n | b T | 1 2 ,
-
The cone invertibility factor (CIF, [9]): for q 1
F ¯ q ( ξ ) = inf 0 b C ( ξ ) | T | 1 / q | X ˜ X ˜ b / n | | b | q .
In this article we will use CIF, because this factor allows for a sharp formulation of convergency results for all l q norms with q 1 , see ([9], Section 3.2). The population (non-random) version of CIF is given by
F q ( ξ ) = inf 0 b C ( ξ ) | T | 1 / q | H ˜ b | | b | q ,
where H ˜ = E X ˜ X ˜ . The key property of the random and the population versions of CIF, F ¯ q ( ξ ) and F q ( ξ ) , is that, in contrast to the smallest eigenvalues of matrices X ˜ X ˜ / n and H ˜ , they can be close to each other in the high-dimensional setting, see ([30], Lemma 4.1) or ([31], Corollary 10.1). This fact is used in the proof of Theorem 3 (given below).
Next, we state the main results of this section.
Theorem 3.
Let a ( 0 , 1 ) , q 1 and ξ > 1 be arbitrary. Suppose that Assumption 2 is satisfied and
n K 1 | T | 2 σ 4 ( 1 + ξ ) 2 log ( p / a ) F q 2 ( ξ )
and
λ K 2 ξ + 1 ξ 1 σ 2 log ( p / a ) n ,
where K 1 , K 2 are universal constants. Then there exists a universal constant K 3 > 0 such that with probability at least 1 K 3 a we have
| b ^ q u a d b * q u a d | q 2 ξ | T | 1 / q λ ( ξ + 1 ) F q ( ξ ) .
In Theorem 3 we provide the upper bound for the estimation error of the Lasso estimator with the quadratic loss function. This result gives the conditions for estimation consistency of b ^ q u a d in the high-dimensional scenario, i.e., the number of predictors can be significantly greater than the sample size. Indeed, consistency in the l -norm holds e.g., when p = exp ( n a 1 ) , | T | = n a 2 , a = exp ( n a 1 ) , where a 1 + 2 a 2 < 1 . Moreover, λ is taken as the right-hand side of the inequality (21) and finally F ( ξ ) is bounded from below (or slowly converging to 0) and σ is bounded from above (or slowly diverging to ).
The choice of the λ parameter is difficult in practice, which is a common drawback of Lasso estimators. However, Theorem 3 gives us a hint how to choose λ . The “safe” choice of λ is the right-hand side of the inequality (21), so, roughly speaking, λ should be proportional to log ( p ) / n . In the experimental part of the paper the parameter λ is chosen using the cross-validation method. As we will observe it gives satisfatory results for the Lasso estimators in both prediction and variable selection.
Theorem 3 is a crucial fact, which gives the upper bound for (15) in Theorem 2. Namely, taking q = 1 , a = 1 / p and λ equal to the right-hand side of the inequality (21), we obtain the following consequence of Theorem 3.
Corollary 1.
Suppose that Assumptions 2 is satisfied. Moreover, assume that there exist ξ 0 > 1 and constants C 1 > 0 and C 2 < such that F 1 ( ξ 0 ) C 1 and σ C 2 . If n K 1 | T | 2 log p , then
P | b ^ q u a d b * q u a d | 1 K 2 | T | log p n 1 K 3 / p ,
where the constants K 1 and K 2 depend only on ξ 0 , C 1 , C 2 and K 3 is a universal constant provided in Theorem 3.
The above result works for Lasso estimators with the quadratic loss. In the case of the logistic loss analogous result is obtained in ([13], Theorem 1). In fact, their results relate to the case of quite general Lipschitz loss functions, which can be useful in extending Theorem 2 to such cases.

5. Variable Selection Properties of Estimators

In Section 3 we are interested in predictive properties of estimators. In this part of the paper we focus on variable selection, which is another important problem in high-dimensional statistics. As we have already noticed upper bounds for probability of the event Ω are crucial in proving results concerning prediction. It also plays a key role in establishing results relating to variable selection. In this section we again focus on the Lasso estimators with the quadratic loss functions. The analogous results for Lipschitz loss functions were considered in ([13], Corollary 1).
In the variable selection problem we want to find significant predictors, which, roughly speaking, give us some information on the observed phenomenon. We consider this problem in the semiparametric model, which is defined in (8). In this case the set of significant predictors is given by (9). As we have already mentioned vectors β and b * q u a d need not be the same. However, in [32] one proved that for a real number γ the following relation
( b * q u a d ) j = γ β j , j = 1 , , p
holds under Assumption 3, which is now stated.
Assumption 3.
Let β ˚ = ( β 1 , , β p ) . We assume that for each θ R p the conditional expectation E θ X | β ˚ X exists and
E θ X | β ˚ X = d θ β ˚ X
for a real number d θ R .
The coefficient γ in (24) can be easily calculated. Namely, we have
γ = E Y β ˚ X β ˚ H β ˚ = 2 E g ( β X ˜ ) β ˚ X β ˚ H β ˚ .
Standard arguments [33] show that γ is nonzero, if g is monotonic. In this case we have that the set T defined in (9) equals to T defined in (11).
Assumption 3 is a well-known condition in the literature, see e.g., [13,32,34,35,36]. It is always satisfied in the simple regression model (i.e., when X 1 R ), which is often used for initial screening of explanatory variables, see, e.g., [37]. It is also satisfied when X comes from the elliptical distribution, like the multivariate normal distribution or multivariate t-distribution. In the interesting paper [38] one advocates that Assumption 3 is a nonrestrictive condition, when the number of predictors is large, which is the case that we focus on in this paper.
Now, we state the results of this part of the paper. We will use the notation b m i n q u a d = min j T | ( b * q u a d ) j | .
Corollary 2.
Suppose that conditions of Theorem 3 are satisfied for q = . If b m i n q u a d 4 ξ λ ( ξ + 1 ) F ( ξ ) , then
P j T , k T | b ^ j q u a d | > | b ^ k q u a d | 1 K 3 a ,
where K 3 is the universal constant from Theorem 3.
In Corollary 2 we show that the Lasso estimator with the quadratic loss is able to separate predictors, if the nonzero coefficients of b ˚ * q u a d are large enough in absolute values. In the case that T equals (9) (i.e., T is the set of significant predictors) we can prove that the thresholded Lasso estimator is able to find the true model with high-probability. This fact is stated in the next result. The thresholded Lasso estimator is denoted by b ^ t h q u a d and defined as
( b ^ t h q u a d ) j = b ^ j q u a d I ( | b ^ j q u a d | δ ) , j = 1 , , p ,
where δ > 0 is a threshold. We set ( b ^ t h q u a d ) 0 = b ^ 0 q u a d and denote T ^ t h = { 1 j p : ( b ^ t h q u a d ) j 0 } .
Corollary 3.
Let g in (8) be monotonic. We suppose that Assumption 3 and conditions of Theorem 3 are satisfied for q = . If
2 ξ λ ( ξ + 1 ) F ( ξ ) < δ b m i n q u a d / 2 ,
then
P T ^ t h = T 1 K 3 a ,
where K 3 is the universal constant from Theorem 3.
Corollary 3 states that the Lasso estimator after thresholding is able to find the true model with high probability, if the threshold is appropriately chosen. However, Corollary 3 does not give a constructive way of choosing the threshold, because both endpoints of the interval [ 2 ξ λ ( ξ + 1 ) F ( ξ ) , b m i n q u a d / 2 ] are unknown. It is not a surprising fact and has been already observed, for instance, in linear models ([9], Theorem 8). In the literature we can find methods, which help to choose a threshold in practice, for instance the approach relying on information criteria developed in [39,40].
Finally, we discuss the condition of Corollary 3 that b m i n q u a d cannot be too small, i.e., b m i n q u a d 4 ξ λ ( ξ + 1 ) F ( ξ ) . We know that ( b * q u a d ) j = γ β j for j = 1 , , p , so the considered condition requires that
min j T | β j | 4 ξ λ | γ | ( ξ + 1 ) F ( ξ ) .
Compared to the similar condition for the Lasso estimators in the well-specified models, we observe that the denominator in (26) contains an additional factor | γ | . This number is usually smaller than one, which means that in the misspecified models the Lasso estimator needs larger sample size to work well. This phenomenon is typical for misspecified models and the similar restrictions hold for competitors [13].

6. Numerical Experiments

In this section we present simulation study, where we compare the accuracy of considered estimators in prediction and variable selection.
We consider the model (8) with predictors generated from the p-dimensional normal distribution N ( 0 , H ) , where H j j = 1 and H j k = 0.5 for j k . The true parameter is
β = ( 1 , ± 1 , ± 1 , , ± 1 10 , 0 , 0 , , 0 ) ,
where signs are chosen at random. The first coordinate in (27) corresponds to the intercept and the next ten coefficients relate to significant predictors in the model. We study two cases:
-
Scenario 1: g ( x ) = exp ( x ) / ( 1 + exp ( x ) ) ;
-
Scenario 2: g ( x ) = arctan ( x ) / π + 0.5 .
In each scenario we generate the data ( X 1 , Y 1 ) , , ( X n , Y n ) for n { 100 , 350 , 600 } . The corresponding numbers of predictors are p { 100 , 1225 , 3600 } , so the number of predictors exceeds significantly the sample size in the experiments. For every model we consider two Lasso estimators with unpenalized intercepts (5): the first one with the logistic loss and the second one with the quadratic loss. They are denoted by “logistic” and “quadratic”, respectively. To calculate them we use the “glmnet” package [28] in the “R” software [41]. The tuning parameters λ are chosen on the basis of 10-fold cross-validation.
Observe that applying the Lasso estimator with the logistic loss function to Scenario 1 leads to a well-specified model, while using the quadratic loss implies misspecification. In Scenario 2 both estimators work in misspecified models.
Simulations for each scenario are repeated 300 times.
To describe the quality of estimators in variable selection we calculate two values:
-
TD—the number of correctly selected relevant predictors;
-
sep—the number of relevant predictors, whose Lasso coefficients are greater in absolute value than the largest in absolute value Lasso coefficient corresponding to irrelevant predictors.
So, we want to confirm that the considered estimators are able to separate predictors, which we establish in Section 5. Using TD we also study “screening” properties of estimators, which are easier than separability.
The classification accuracy of estimators is measured in the following way: we generate a test sample containg 1000 objects. On this set we calculate
-
pred—the fraction of correctly predicted classes of objects for each estimator.
The results of experiments are collected in Table 1 and Table 2. By the “oracle” we mean the classifier, which works only with significant predictors and uses the function g from the true model (8) in the estimation process.
Finally, we also compare execution time of both algorithms. In Table 3 we show the averaged relative time difference
t l o g t q u a d t q u a d ,
where t q u a d and t l o g is time of calculating Lasso with quadratic and logistic loss functions, respectively.
Looking at the results of experiments we observe that both estimators perform in a satisfactory way. Their predictive accuracy is relatively close to the oracle, especially when the sample size is larger. In variable selection we see that both estimators are able to find significant predictors and separate predictors in both scenarios. Again we can notice that properties of estimators become better, when n increases.
In Scenario 2 the quality of both estimators in prediction and variable selection is comparable. In Scenario 1, which is well-specified for Lasso with the logistic loss, we observe its dominance over Lasso with the quadratic loss. However, this dominance is not large. Therefore, using Lasso with the quadratic loss we obtain slightly worse accuracy of the procedure, but this algorithm is computationally faster. The computational efficiency is especially important, when we study large data sets. As we can see in Table 3 execution time of estimators is almost the same for n = 350 , but for n = 600 the relative time difference becomes greater than 10 % .

Author Contributions

Both authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

The research of K.F. was partially supported by Warsaw University of Life Sciences (SGGW).

Acknowledgments

We would like to thank J. Mielniczuk and the reviewers for their valuable comments, which have improved the paper.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proofs and Auxiliary Results

This section contains proofs of results from the paper. Additional lemmas are also provided.

Appendix A.1. Results from Section 3

Proof of Theorem 1.
For arbitrary b R p + 1 the averaged misclassification risk of f b can be expressed as
E D R ( b ) = E D E ( X , Y ) I ( Y = 1 ) I ( b X ˜ < 0 ) + I ( Y = 1 ) I ( b X ˜ 0 ) .
Moreover, we have
I ( Y = 1 ) I ( b X ˜ 0 ) = I ( Y = 1 ) 1 I ( b X ˜ < 0 ) .
Applying (A1) and (A2) for b ^ and b * , we obtain
| E D R ( b ^ ) R ( b * ) | = E D E ( X , Y ) I ( Y = 1 ) I ( Y = 1 ) I ( b ^ X ˜ < 0 ) I ( b * X ˜ < 0 ) E D E ( X , Y ) I ( b ^ X ˜ < 0 ) I ( b * X ˜ < 0 ) = P ( b ^ X ˜ < 0 , b * X ˜ 0 ) + P ( b ^ X ˜ 0 , b * X ˜ < 0 ) ,
where P is probability with respect to both the data D and the new object X . Observe that on the event Ω , we have that
b ^ X ˜ c | X ˜ | + b * X ˜ ,
so
P ( b ^ X ˜ 0 , b * X ˜ < 0 ) = P ( b ^ X ˜ 0 , b * X ˜ < 0 , Ω ) + P ( b ^ X ˜ 0 , b * X ˜ < 0 , Ω c ) P X ( c | X ˜ | b * X ˜ < 0 ) + P D ( Ω c ) .
Analogously, we obtain
P ( b ^ X ˜ < 0 , b * X ˜ 0 ) P X ( 0 b * X ˜ c | X ˜ | ) + P D ( Ω c )
from
b ^ X ˜ c | X ˜ | + b * X ˜ ,
which finishes the proof. □
Lemma A1.
Suppose that Assumption 1 is fulfilled. Moreover, a random variable b * X ˜ has a density h , which is continuous on the interval U = [ 2 σ c log p , 2 σ c log p ] and h ˜ = sup u U h ( u ) . Then
P X ( | b * X ˜ | c | X ˜ | ) 4 σ h ˜ c log p + 2 / p .
Proof. 
For simplicity, we omit the lower index X in probability P X in this proof. We take a > 1 and obtain inequalities
P ( | b * X ˜ | c | X ˜ | ) P ( | b * X ˜ | c | X ˜ | , | X ˜ | a ) + P ( | b * X ˜ | c | X ˜ | , | X ˜ | > a ) P ( | b * X ˜ | c a ) + P ( | X ˜ | > a ) .
The second expression in (A4) equals P ( | X | > a ) , because a > 1 . It can be handled using subgaussianity of X as follows: take z > 0 and notice that by the Markov inequality and the fact that exp ( | u | ) exp ( u ) + exp ( u ) for each u R , we obtain
P ( | X | > a ) e z a E exp ( z | X | ) e z a j = 1 p E exp ( z | X j | ) 2 p exp ( σ 2 z 2 / 2 a z ) .
Taking z = a / σ 2 , we obtain
P ( | X | > a ) 2 p exp ( a 2 / ( 2 σ 2 ) ) .
Then we choose a = 2 σ log p , which is not smaller than one, because σ 1 from Assumption 1.
Finally, the first term in (A4) can be bounded from above by 2 c a h ˜ = 4 σ h ˜ c log p by the mean value theorem. □
Proof of Theorem 2.
The right-hand side of (15) and (17) are upper bounds on the estimation risk. They are obtained using Theorem 1 and Lemma A1. The expressions (16) and (18) are upper bounds for the approximation risk in the case of estimators with the quadratic and logistic loss functions, respectively. In particular, (16) follows from ([17], [Theorem 2.1) applied for f b * q u a d and Example 3.1. Establishing (18) is similar: we just use ([17], [Theorem 2.1) applied for f b * l o g and Example 3.5 to show that
R ( b * l o g ) R B 2 E K L η ( X ) , η l o g ( ( b * l o g ) X ˜ ) ,
where the Kullback–Leibler distance K L ( · , · ) is defined in (10).
Next, we define the function h ( a ) = a log a + ( 1 a ) log ( 1 a ) for a ( 0 , 1 ) . Clearly, we have K L ( a , b ) = h ( a ) h ( b ) h ( b ) ( a b ) and h ( a ) = ( a ( 1 a ) ) 1 . Therefore, from the mean value theorem
K L ( a , b ) = ( a b ) 2 2 c ( 1 c )
for some c between a and b. To finish the proof we apply (A6) to the right-hand side of (A5) with δ < c < 1 δ .

Appendix A.2. Results from Section 4

To simplify notation we write b ^ , b * for b ^ q u a d , b * q u a d , respectively, in this section. Moreover, we also denote b ˚ * = ( ( b * ) 1 , , ( b * ) p ) .
We start with establishing results, which help us to prove Theorem 3.
Lemma A2.
For b ˚ * = H 1 E X Y we have b ˚ * H b ˚ * 1 .
Proof. 
The proof is elementary and based on the inequality
0 E E Y | X b ˚ * X 2 .
The right-hand side of (A7) can be expressed as
E [ E Y | X ] 2 2 b ˚ * E X Y | X + b ˚ * X X b ˚ * = E [ E Y | X ] 2 2 b ˚ * E X Y + b ˚ * H b ˚ * .
Using b ˚ * = H 1 E X Y , we have b ˚ * E X Y = b ˚ * H H 1 E X Y = b ˚ * H b ˚ * and we can bound from above the right-hand side of (A8) by
E Y 2 b ˚ * H b ˚ * ,
which finishes the proof. □
The next result is given in ([42], Corollary 8.2).
Lemma A3.
Suppose that Z 1 , , Z n are i.i.d. random variables and there exists L > 0 such that C 2 = E exp | Z 1 | / L is finite. Then for arbitrary u > 0
P 1 n i = 1 n ( Z i E Z i ) > 2 L C 2 u n + u n exp ( u ) .
Lemma A4.
For arbitrary j = 1 , , p and u > 0 we have
P 2 n i = 1 n X i j ( X i b ˚ * + E Y Y i ) > 16.4 σ 2 3 2 u n + u n exp ( u ) .
Proof. 
Fix j = 1 , , p and u > 0 . Recall that H b ˚ * = E Y X and E X = 0 . Thus, we work with an average of i.i.d. centred random variables, so we can use Lemma A3. We only have to find L , C > 0 such that
E exp | X j ( X b ˚ * + E Y Y ) | / L C 2 ,
where X j is the j-th coordinate of X . For each positive number a , b we have the inequality a b a 2 2 + b 2 2 . Therefore, we have
| X j ( X b ˚ * + E Y Y ) | X j 2 2 + ( X b ˚ * ) 2 + 4 .
Applying this fact and the Schwarz inequality we obtain
E exp | X j ( X b ˚ * Y ) | / L exp 4 L E exp X j 2 L E exp 2 ( X b ˚ * ) 2 L .
The variable X j is subgaussian, so using ([43], Lemma 7.4) we can bound the first expectation in (A11) by 1 2 σ 2 L 1 / 2 provided that L > 2 σ 2 . The second expectation in (A11) can be bounded using subgaussianity of the vector X T , ([43], Lemma 7.4) and Lemma A2 in the following way
E exp 2 ( X b ˚ * ) 2 L 1 4 σ 2 L 1 / 2 ,
provided that 4 σ 2 < L . Taking L = 4.1 σ 2 we can bound exp ( 4 / L ) 2.7 , because H j j = 1 implies that σ 1 . Thus, we obtain C 3 , where C is the upper bound in (A10). It finishes the proof. □
Lemma A5.
Suppose that assumptions of Theorem 3 are satisfied. Then for arbitrary a ( 0 , 1 ) , q 1 , ξ > 1 with probability at least 1 K a we have F ¯ q ( ξ ) F q ( ξ ) / 2 , where K is an universal constant.
Proof. 
Fix a ( 0 , 1 ) , q 1 , ξ > 1 . We start with considering the l -norm of the matrix
1 n X ˜ X ˜ E X ˜ X ˜ = max max j , k = 1 , , p 1 n i = 1 n X i j X i k E X j X k ,
max j = 1 , , p 1 n i = 1 n X i j .
We focus only on the right-hand side of (A12), because (A13) can be done similarly. Thus, fix j , k { 1 , , p } . Using subgaussianity of predictors, Lemma A3 and argumentation similar to the proof of Lemma A4 we have
P 1 n i = 1 n X i j X i k E X 1 j X 1 k > K 2 σ 2 log ( p 2 / a ) n 2 a p 2 ,
where K 2 is an universal constant. The values of constants K i that appear in this proof can change from line to line.
Therefore, using union bounds we obtain
P 1 n X ˜ X ˜ E X ˜ X ˜ > K 2 σ 2 log ( p 2 / a ) n K 3 a .
Proceeding similarly to the proof of ([30], Lemma 4.1) we have the following probabilistic inequality
F ¯ q ( ξ ) F q ( ξ ) K 2 ( 1 + ξ ) | T | σ 2 log ( p 2 / a ) n .
To finish the proof we use (20) with K 1 being sufficiently large. □
Proof of Theorem 3.
Let a ( 0 , 1 ) , q 1 , ξ > 1 be arbitrary. The main part of the proof is to show that with high probability
| b ^ b * | q ξ | T | 1 / q λ ( ξ + 1 ) F ¯ q ( ξ ) .
Then we apply Lemma A5 to obtain (22).
Thus, we focus on showing that (A14) holds with high probability. Denote A = { | Q ¯ ( b * ) | ξ 1 ξ + 1 λ } . We start with bounding from below probability of A . Recall that b * is the minimizer of Q ( b ) = E ( 1 Y b X ˜ ) 2 , which can be easily calculated, namely
b ˚ * = H 1 E Y X and ( b * ) 0 = E Y .
For every j = 1 , , p the j-th partial derivative of Q ¯ ( b ) at b * is
j Q ¯ ( b * ) = 2 n i = 1 n X i j ( X i b ˚ * + E Y Y i ) .
The derivative with respect to the b 0 is
0 Q ¯ ( b * ) = 2 n i = 1 n ( X i b ˚ * + E Y Y i ) .
Taking λ , which satisfies (21), and using union bounds, we obtain that
P ( A c ) j = 0 p P | j Q ¯ ( b * ) | > K 2 σ 2 log ( p / a ) n .
Consider a summand on the right-hand side of (A17), which corresponds to j { 1 , , p } . From (A15) we can handle it using Lemma A4. We just take u = log ( p / a ) and suffciently large K 2 . Probability of the first term on the right-hand side of (A17), which corresponds to j = 0 , can be bounded from above analogously as in the proof of Lemma A4. Proceeding is even easier, so we omit it.
In further argumentation we consider only the event A . Besides, we denote θ = b ^ b * , where b ^ is a minimizer of a convex function (5), that is equivalent to
j Q ¯ ( b ^ ) = λ sign ( b ^ j ) for b ^ j 0 ; | j Q ¯ ( b ^ ) | λ for b ^ j = 0 ; 0 Q ¯ ( b ^ ) = 0 ,
where j = 1 , , p .
First, we prove that θ C ( ξ ) . Here our argumentation is standard [9]. From (A18) and the fact that | θ | 1 = | θ T | 1 + | θ T c | 1 + | θ 0 | we can calculate
0 2 θ X ˜ X ˜ θ / n = θ Q ¯ ( b ^ ) Q ¯ ( b * ) = j T θ j j Q ¯ ( b ^ ) + j T c b ^ j j Q ¯ ( b ^ ) θ Q ¯ ( b * ) λ j T | θ j | λ j T c | b ^ j | + | θ | 1 | Q ¯ ( b * ) | = λ + | Q ¯ ( b * ) | | θ T | 1 + | Q ¯ ( b * ) | λ | θ T c | 1 + | θ 0 | | Q ¯ ( b * ) | .
Thus, using the fact that we consider the event A we get
| θ T c | 1 λ + | Q ¯ ( b * ) | λ | Q ¯ ( b * ) | | θ T | 1 + | Q ¯ ( b * ) | λ | Q ¯ ( b * ) | | θ 0 | ξ | θ T ˜ | 1 .
Therefore, from the definition of F ¯ q ( ξ ) we have
| b ^ b * | q | T | 1 / q | X ˜ X ˜ ( b ^ b * ) / n | F ¯ q ( ξ ) | T | 1 / q | Q ¯ ( b ^ ) | / 2 + | Q ¯ ( b * ) | / 2 F ¯ q ( ξ ) .
Using (A18) and the fact, that we are on A , we obtain (A14). □

Appendix A.3. Results from Section 5

Proof of Corollary 2.
The proof is a simple consequence of the bound (22) with q = obtained in Theorem 3. Indeed, for arbitrary predictors j T and k T we obtain
| b ^ j q u a d | | ( b * q u a d ) j | | b ^ j q u a d ( b * q u a d ) j | b m i n q u a d | b ^ q u a d b * q u a d | > 2 ξ λ ( ξ + 1 ) F ( ξ ) | b ^ q u a d ( b * q u a d ) | | b ^ k q u a d ( b * q u a d ) k | = | b ^ k q u a d | .
Proof of Corollary 3.
The proof is almost the same as the proof of Corollary 2, so it is omitted. □

References

  1. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Data Mining, Inference and Prediction; Springer: New York, NY, USA, 2001. [Google Scholar]
  2. Bühlmann, P.; van de Geer, S. Statistics for High-Dimensional Data: Methods, Theory and Applications; Springer: New York, NY, USA, 2011. [Google Scholar]
  3. Tibshirani, R. Regression Shrinkage and Selection via the Lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
  4. Meinshausen, N.; Bühlmann, P. High-dimensional graphs and variable selection with the Lasso. Ann. Stat. 2006, 34, 1436–1462. [Google Scholar] [CrossRef] [Green Version]
  5. Zhao, P.; Yu, B. On Model Selection Consistency of Lasso. J. Mach. Learn. Res. 2006, 7, 2541–2563. [Google Scholar]
  6. Zou, H. The adaptive Lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef] [Green Version]
  7. van de Geer, S. High-dimensional generalized linear models and the Lasso. Ann. Stat. 2008, 36, 614–645. [Google Scholar] [CrossRef]
  8. Bickel, P.J.; Ritov, Y.; Tsybakov, A.B. Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat. 2009, 37, 1705–1732. [Google Scholar] [CrossRef]
  9. Ye, F.; Zhang, C.H. Rate minimaxity of the Lasso and Dantzig selector for the lq loss in lr balls. J. Mach. Learn. Res. 2010, 11, 3519–3540. [Google Scholar]
  10. Huang, J.; Zhang, C.H. Estimation and Selection via Absolute Penalized Convex Minimization and Its Multistage Adaptive Applications. J. Mach. Learn. Res. 2012, 13, 1839–1864. [Google Scholar]
  11. Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comp. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef] [Green Version]
  12. Vapnik, V.N. Statistical Learning Theory; Wiley: New York, NY, USA, 1998. [Google Scholar]
  13. Kubkowski, M.; Mielniczuk, J. Selection Consistency of Lasso-Based Procedures for Misspecified High-Dimensional Binary Model and Random Regressors. Entropy 2020, 22, 153. [Google Scholar] [CrossRef] [Green Version]
  14. Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Statist. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  15. Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
  16. Quintero, F.; Contreras-Reyes, J.E.; Wiff, R.; Arellano-Valle, R.B. Flexible Bayesian analysis of the von Bertalanffy growth function with the use of a log-skew-t distribution. Fish. Bull. 2017, 115, 12–26. [Google Scholar]
  17. Zhang, T. Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Stat. 2004, 32, 56–85. [Google Scholar] [CrossRef]
  18. Bartlett, P.L.; Jordan, M.I.; McAuliffe, J.D. Convexity, classification and risk bounds. J. Am. Stat. Assoc. 2006, 101, 138–156. [Google Scholar] [CrossRef] [Green Version]
  19. Devroye, L.; Györfi, L.; Lugosi, G. A Probabilistic Theory of Pattern Recognition; Springer-Verlag: New York, NY, USA, 1996. [Google Scholar]
  20. Boucheron, S.; Bousquet, O.; Lugosi, G. Introduction to statistical learning theory. Adv. Lect. Mach. Learn. 2004, 36, 169–207. [Google Scholar]
  21. Boucheron, S.; Bousquet, O.; Lugosi, G. Theory of classification: A survey of some recent advances. ESAIM P&S 2005, 9, 323–375. [Google Scholar]
  22. Bartlett, P.L.; Bousquet, O.; Mendelson, S. Local Rademacher complexities. Ann. Stat. 2005, 33, 1497–1537. [Google Scholar] [CrossRef]
  23. Audibert, J.Y.; Tsybakov, A.B. Fast learning rates for plug-in classifiers. Ann. Stat. 2007, 35, 608–633. [Google Scholar] [CrossRef]
  24. Blanchard, G.; Bousquet, O.; Massart, P. Statistical performance of support vector machines. Ann. Stat. 2008, 36, 489–531. [Google Scholar] [CrossRef]
  25. Tarigan, B.; van de Geer, S. Classifiers of support vector machine type with l1 complexity regularization. Bernoulli 2006, 12, 1045–1076. [Google Scholar] [CrossRef]
  26. Abramovich, F.; Grinshtein, V. High-Dimensional Classification by Sparse Logistic Regression. IEEE Trans. Inf. Theory 2019, 65, 3068–3079. [Google Scholar] [CrossRef] [Green Version]
  27. Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R. Least angle regression. Ann. Stat. 2004, 32, 407–499. [Google Scholar]
  28. Friedman, J.; Hastie, T.; Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010, 33, 1–22. [Google Scholar] [CrossRef] [Green Version]
  29. Buldygin, V.; Kozachenko, Y. Metric Characterization of Random Variables and Random Processes; American Mathematical Society: Providence, RI, USA, 2000. [Google Scholar]
  30. Huang, J.; Sun, T.; Ying, Z.; Yu, Y.; Zhang, C.H. Oracle inequalities for the lasso in the Cox model. Ann. Stat. 2013, 41, 1142–1165. [Google Scholar] [CrossRef] [Green Version]
  31. van de Geer, S.; Bühlmann, P. On the conditions used to prove oracle results for the Lasso. Electron. J. Stat. 2009, 3, 1360–1392. [Google Scholar] [CrossRef] [Green Version]
  32. Li, K.C.; Duan, N. Regression analysis under link violation. Ann. Stat. 1989, 17, 1009–1052. [Google Scholar] [CrossRef]
  33. Thorisson, H. Coupling methods in probability theory. Scand. J. Stat. 1995, 22, 159–182. [Google Scholar]
  34. Brillinger, D.R. A Generalized Linear Model with Gaussian Regressor Variables; A Festschrift for Erich Lehmann; Bickel, P.J., Doksum, K., Hodges, J.L., Eds.; Wadsworth: Belmont, CA, USA, 1983; pp. 97–114. [Google Scholar]
  35. Ruud, P.A. Sufficient Conditions for the Consistency of Maximum Likelihood Estimation Despite Misspecification of Distribution in Multinomial Discrete Choice Models. Econometrica 1983, 51, 225–228. [Google Scholar] [CrossRef]
  36. Zhong, W.; Zhu, L.; Li, R.; Cui, H. Regularized quantile regression and robust feature screening for single index models. Stat. Sin. 2016, 26, 69–95. [Google Scholar] [CrossRef] [Green Version]
  37. Fan, J.; Lv, J. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B 2008, 70, 849–911. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  38. Hall, P.; Li, K.C. On almost Linearity of Low Dimensional Projections from High Dimensional Data. Ann. Stat. 1993, 21, 867–889. [Google Scholar] [CrossRef]
  39. Pokarowski, P.; Mielniczuk, J. Combined l1 and Greedy l0 Penalized Least Squares for Linear Model Selection. J. Mach. Learn. Res. 2015, 16, 961–992. [Google Scholar]
  40. Pokarowski, P.; Rejchel, W.; Soltys, A.; Frej, M.; Mielniczuk, J. Improving Lasso for model selection and prediction. arXiv 2019, arXiv:1907.03025. [Google Scholar]
  41. R Development Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2017. [Google Scholar]
  42. van de Geer, S. Estimation and Testing under Sparsity; Springer: Berlin, Germany, 2016. [Google Scholar]
  43. Baraniuk, R.; Davenport, M.A.; Duarte, M.F.; Hegde, C. An Introduction to Compressive Sensing; Connexions, Rice University: Houston, TX, USA, 2011. [Google Scholar]
Table 1. Results for Scenario 1.
Table 1. Results for Scenario 1.
n = 100 QuadraticLogisticOracle
TD6.36.1
sep2.22.3
pred0.7340.7360.810
n = 350
TD9.39.5
sep6.06.3
pred0.7740.7790.831
n = 600
TD9.89.9
sep8.68.9
pred0.7910.7950.832
Table 2. Results for Scenario 2.
Table 2. Results for Scenario 2.
n = 100 QuadraticLogisticOracle
TD4.84.6
sep1.41.4
pred0.6970.6980.768
n = 350
TD8.18.2
sep3.93.9
pred0.7300.7310.805
n = 600
TD9.49.4
sep6.86.9
pred0.7500.7520.809
Table 3. Relative time difference (28) of algorithms.
Table 3. Relative time difference (28) of algorithms.
Scenario 1Scenario 2
n = 350 0.020.06
n = 600 0.110.13

Share and Cite

MDPI and ACS Style

Furmańczyk, K.; Rejchel, W. Prediction and Variable Selection in High-Dimensional Misspecified Binary Classification. Entropy 2020, 22, 543. https://doi.org/10.3390/e22050543

AMA Style

Furmańczyk K, Rejchel W. Prediction and Variable Selection in High-Dimensional Misspecified Binary Classification. Entropy. 2020; 22(5):543. https://doi.org/10.3390/e22050543

Chicago/Turabian Style

Furmańczyk, Konrad, and Wojciech Rejchel. 2020. "Prediction and Variable Selection in High-Dimensional Misspecified Binary Classification" Entropy 22, no. 5: 543. https://doi.org/10.3390/e22050543

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop