Default priors for the intercept parameter in logistic regressions

doi:10.1016/j.csda.2018.10.014

Computational Statistics & Data Analysis

Volume 133, May 2019, Pages 245-256

https://doi.org/10.1016/j.csda.2018.10.014 Get rights and content

Abstract

In logistic regression, separation occurs when a linear combination of predictors perfectly discriminates the binary outcome. Because finite-valued maximum likelihood parameter estimates do not exist under separation, Bayesian regressions with informative shrinkage of the regression coefficients offer a suitable alternative. Classical studies of separation imply that efficiency in estimating regression coefficients may also depend upon the choice of intercept prior, yet relatively little focus has been given on whether and how to shrink the intercept parameter.Alternative prior distributions for the intercept are proposed that downweight implausibly extreme regions of the parameter space, rendering regression estimates that are less sensitive to separation. Through simulation and the analysis of exemplar datasets, differences across priors stratified by established statistics measuring the degree of separation are quantified. Relative to diffuse priors, these proposed priors generally yield more efficient estimation of the regression coefficients themselves when the data are nearly separated. They are equally efficient in non-separated datasets, making them suitable for default use. Modest differences were observed with respect to out-of-sample discrimination.These numerical studies also highlight the interplay between priors for the intercept and the regression coefficients:findings are more sensitive to the choice of intercept prior when using a weakly informative prior on the regression coefficients than an informative shrinkage prior.

Introduction

A default prior in principle falls between nearly flat/improper priors, e.g. Jeffrey’s prior (Jeffreys, 1946), and informative shrinkage/variable selection priors, e.g. the Bayesian Lasso (Park and Casella, 2008). Such a prior mildly shrinks toward some null value non- or weakly identified parameters and leaves alone those well supported by the likelihood (Gelman et al., 2008, Greenland and Mansournia, 2015, Rainey, 2016). However, it does not borrow strength via shared hyperpriors as do informative shrinkage priors.

Default priors have been developed for binary data regression models, e.g. logistic regression, because of the possibility of so-called ‘separation’, or the existence of a linear combination of predictors that can perfectly discriminate the outcomes in the data (Albert and Anderson, 1984, Santner and Duffy, 1986). Separation can be ‘complete’ or ‘quasi-complete’, with both leading to non-finite maximum likelihood estimates (MLEs, Albert and Anderson, 1984, Santner and Duffy, 1986). A truly large association between a predictor and the outcome can cause separation, illustrating that this phenomenon is not always undesirable. Other causes include sparsity, high correlation between the predictors, or the inclusion of many binary predictors (Heinze and Schemper, 2002, Greenland and Mansournia, 2015). Regardless of cause, the lack of identification may warrant mild regularization via priors. For the intercept parameter, a very diffuse or improper prior is the usual choice; in this paper, we contend that its unique role and meaning warrant a more informative choice.

A number of authors have proposed default prior specifications for regression coefficients (Clogg et al., 1991, Bedrick et al., 1996, Zorn, 2005, Gelman et al., 2008, Hanson et al., 2014, Greenland and Mansournia, 2015). The scale-family of $g$ -priors, or reference informative priors, is one early example of a default prior for regression coefficients (Zellner, 1983). The degree of shrinkage – and therefore the extent to which that family of priors may be viewed as ‘default’ – depends on the choice of scale parameter $g$ , which is shared by all regression coefficients. If fixed at some diffuse or prespecified value, this would satisfy our definition of a default prior (Hanson et al., 2014). If instead $g$ is adaptively tuned via a hyperprior, as in Marin and Robert (2007), this would not, as information is shared across parameters.

Christmann and Rousseeuw (2001) propose quantitative measures of separation: $n_{comp} \geq 0$ is the size of the smallest subset of observations that, if removed from the data, would completely separate the complementary subset. Separation is equivalent to $n_{comp} = 0$ ; the resulting lack of numerical convergence allows for relatively easy detection. In contrast, near-separation, i.e. a small but positive $n_{comp}$ , yields finite MLEs but manifests symptoms of separation including instability and efficiency loss. For this reason, mild regularization from default priors can be just as useful when there are a few dozen predictors as when there are hundreds or more, particularly when the number of observations is of a similar order, i.e. $p \approx n$ .

The point of departure in this work is a focus on the intercept parameter. In contrast to regression coefficients, there is generally not an intuitive null direction toward which shrinkage should be, and borrowing strength for the intercept is not possible (e.g. Section 3.4, Hastie et al., 2009). The usual recommendation is that a prior should be flat or effectively so (Greenland and Mansournia, 2015, Zorn, 2005, Gelman et al., 2008). We demonstrate that straightforward efficiency gains are possible by assuming that exceptionally large values of the intercept are either implausible or unverifiable in the data. Thus, down-weighting these regions frequently improves efficiency.

This paper makes several contributions. First, we use complete separation to establish a rationale that mild shrinkage of the intercept can improve estimation of the regression coefficients. Following Ghosh et al. (2017), we consider a stronger type of separation that we call ‘pivotal separation’. Second, we propose to adapt the exponential-power scale-family of distributions (Box, 1953, West, 1987, Box and Tiao, 1992) for default use as a prior on the intercept in binary data regression models and develop an algorithm to determine a suitable scale for this prior. Finally, our work highlights the correspondence between choice of prior on the intercept and that of the regression coefficients.

In Section 2, we motivate the interplay between intercept and regression coefficients under the premise of separation. Sections 3 Possible choices of default prior, 4 Prior on describe our choices of the different class priors for the intercept and the regression coefficients, respectively. Findings from a comprehensive simulation study are documented in Section 5, which provides a comparative appraisal of the Bayesian estimators under varied scenarios including sparsity and $p \approx n$ . The demonstration of our methodology on ten datasets in Section 6 illustrates the heterogeneity in the degree of separation in real data and, more importantly, highlights the stabilizing properties of our proposed priors in estimation of the regression coefficients. Section 7 interprets our results and discusses some counterarguments against using priors on the intercept parameter.

Section snippets

Motivation from separation

We have $n$ data points denoted by ${Y_{i}, X_{i}}_{1}^{n}$ , where $Y_{i} \in {0, 1}$ , and $X_{i}$ is a $p$ -dimensional vector of covariates. A generalized linear model (GLM) takes the form $g (\Pr (Y_{i} = 1 | X_{i})) = α + X_{i}^{⊤} β$ , where $g$ is a link function, e.g. logistic, probit, or complementary log–log, mapping the unit interval to the real line. The likelihood is $L (α, β) = \prod_{i} g^{- 1} {(α + β^{⊤} X_{i})}^{Y_{i}} {[1 - g^{- 1} (α + β^{⊤} X_{i})]}^{1 - Y_{i}} .$ Partition the outcomes into sets $A = {i : Y_{i} = 1}$ and $A^{C} = {i : Y_{i} = 0}$ . Complete separation holds when there exists $D \in R^{p + 1}$ such that, for any ${α, β} \in D$ , $α +$

Possible choices of default prior

For the remainder of the paper, we focus on the logistic link function, $g (x) = logit (x) \equiv \log_{e} (x ∕ [1 - x])$ . We center the covariate vector to the empirical mean in the data, so that $X = 0$ represents the mean, and implicitly assume that all prior formulations have a location parameter equal to zero. Probabilistically, the elements of $β$ (and therefore prior interpretations) can only be framed in relative terms. For example, a log odds-ratio of $\log_{e} (1.3)$ increases a baseline probability of 0.004 to about

Prior on $β$

Our objective is not to directly compare priors on $β$ but rather to isolate the impact of the choice of prior on $α$ , given a typical choice of prior on $β$ . Unavoidable, however, is that $α$ and $β$ will be a posteriori correlated, and thus the choice of prior on $β$ matters. As discussed in Section 1, default priors on $β$ are meant to provide automatic weak shrinkage of unstable components of $β$ , whereas informative shrinkage priors proactively distinguish between signal and noise in the components of $β$

Simulation study

We conduct a simulation study to compare the performance of six priors on $α$ in terms of their impact on estimation (of the regression coefficients $β$ ) and discrimination of observations. We designed the scenarios to represent the challenging regressions that may lead to separation, namely rare events (small values of $α$ ), large associations ( $β$ large in magnitude), correlated, binary, and skewed $X$ , and $p \approx n$ . All numerical analyses were conducted in the R statistical environment (R Core Team, 2016,

Data examples

We demonstrate our proposed priors on ten exemplar datasets with varying $p ∕ n$ ratios and degrees of separation and sparsity. Although these datasets have been previously reported, our re-analysis is novel in two ways. First, it quantifies heterogeneity in the degree of separation between datasets using the statistics $n_{comp}$ , etc. Although Christmann and Rousseeuw (2001) have already studied this in five of these ten datasets, Algorithm 2 was able to identify a tighter upper-bound on $n_{comp}$ or $n_{over}$

Conclusion

Understanding the role the intercept parameter plays in a GLM is conceptually difficult. When the covariates are not centered and the support of $X$ lies far from the origin, the parameter lacks meaningful interpretation. Yet prediction based on a model without an intercept will, in general, incur significant error. The intercept plays the crucial role of balancing the prediction plane. When the covariates are centered, the intercept is a function of the regression coefficients, as in the third

Acknowledgment

This work was supported by the National Institutes of Health, USA [Grant Number P30 CA046592]

References (43)

ChristmannA. et al.
Measuring overlap in binary regression
Comput. Statist. Data Anal.
(2001)
LeeE.
A computer program for linear logistic regression analysis
Comput. Programs Biomed.
(1974)
AlbertA. et al.
On the existence of maximum likelihood estimates in logistic regression models
Biometrika
(1984)
ArmaganA. et al.
Generalized double Pareto shrinkage
Statist. Sinica
(2013)
BarbaroR. et al.
Evaluating mortality risk adjustment among children receiving extracorporeal support for respiratory failure
ASAIO J.
(2018)
BarbaroR. et al.
Development and validation of the pediatric risk estimate score for children using extracorporeal respiratory support (Ped-RESCUERS)
Intensive Care Med.
(2016)
BedrickE. et al.
A new perspective on priors for generalized linear models
J. Amer. Statist. Assoc.
(1996)
BoxG.
A note on regions for tests of kurtosis
Biometrika
(1953)
BoxG. et al.
Bayesian Inference in Statistical Analysis
(1992)
CarpenterB.
Stan: A probabilistic programming language
J. Statist. Software
(2017)

CarvalhoC.M. et al.

Handling sparsity via the horseshoe

CarvalhoC. et al.

The horseshoe estimator for sparse signals

Biometrika

(2010)

CloggC. et al.

Multiple imputation of industry and occupation codes in census public-use samples using Bayesian logistic regression

J. Amer. Statist. Assoc.

(1991)

FinneyD.

The estimation from individual records of the relationship between dose and quantal response

Biometrika

(1947)

GelmanA. et al.

Data Analysis using Regression and Multilevel/hierarchical Models

(2007)

GelmanA. et al.

A weakly informative default prior distribution for logistic and other regression models

Ann. Appl. Stat.

(2008)

GhoshJ. et al.

On the use of cauchy prior distributions for Bayesian logistic regression

Bayesian Anal.

(2017)

GreenlandS. et al.

Penalization, bias reduction, and default priors in logistic and related categorical and survival regressions

Stat. Med.

(2015)

HansonT. et al.

Informative $g$ -priors for logistic regression

Bayesian Anal.

(2014)

HastieT. et al.

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

(2009)

HeinzeG. et al.

A solution to the problem of separation in logistic regression

Stat. Med.

(2002)

Cited by (0)

View full text

Default priors for the intercept parameter in logistic regressions

Abstract

Introduction

Section snippets

Motivation from separation

Possible choices of default prior

Prior on β

Simulation study

Data examples

Conclusion

Acknowledgment

Comput. Statist. Data Anal.

Comput. Programs Biomed.

On the existence of maximum likelihood estimates in logistic regression models

Biometrika

Generalized double Pareto shrinkage

Statist. Sinica

Evaluating mortality risk adjustment among children receiving extracorporeal support for respiratory failure

ASAIO J.

Development and validation of the pediatric risk estimate score for children using extracorporeal respiratory support (Ped-RESCUERS)

Intensive Care Med.

A new perspective on priors for generalized linear models

J. Amer. Statist. Assoc.

A note on regions for tests of kurtosis

Biometrika

Bayesian Inference in Statistical Analysis

Stan: A probabilistic programming language

J. Statist. Software

Handling sparsity via the horseshoe

The horseshoe estimator for sparse signals

Biometrika

Multiple imputation of industry and occupation codes in census public-use samples using Bayesian logistic regression

J. Amer. Statist. Assoc.

The estimation from individual records of the relationship between dose and quantal response

Biometrika

Data Analysis using Regression and Multilevel/hierarchical Models

A weakly informative default prior distribution for logistic and other regression models

Ann. Appl. Stat.

On the use of cauchy prior distributions for Bayesian logistic regression

Bayesian Anal.

Penalization, bias reduction, and default priors in logistic and related categorical and survival regressions

Stat. Med.

Informative g-priors for logistic regression

Bayesian Anal.

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

A solution to the problem of separation in logistic regression

Stat. Med.

Prior on $β$

Informative $g$ -priors for logistic regression