A new variant of the parallel regression model with variable selection in surveys with sensitive attribute

doi:10.1016/j.jspi.2020.08.006

Journal of Statistical Planning and Inference

Volume 212, May 2021, Pages 69-83

https://doi.org/10.1016/j.jspi.2020.08.006 Get rights and content

Highlights

•
Broadening the application scope of the non-randomized response techniques.
•
Examining the relationship between the sensitive attribute and other covariate(s).
•
Overcoming the limitation of logistic parallel regression model (Liu et al., 2019).
•
Providing methods for confounding variable identification and variable selection.

Abstract

In this paper, a new hidden logistic regression model, i.e., the variant of the parallel regression model, is developed to study the relationship between a sensitive binary response variable and a set of non-sensitive covariates, where the information about the sensitive attribute of interest is collected via the variant of the parallel model originally proposed by Liu and Tian (2013b). The EM–NR algorithm is provided to derive the maximum likelihood estimates of the regression coefficients. Furthermore, we also discuss the method of identifying confounding variable and the method of variable selection based on SCAD penalty for the variant of the parallel regression model. Finally, simulation studies and a real data example about premarital sexual behavior were conducted to illustrate the proposed approaches.

Introduction

In practice, some investigations deal with phenomena that are considered as violation of social morality or illegal activities (such as bribery, tax evasion, illegitimate child, illegal immigration, drug abuse and so on) will make interviewees feel embarrassed and refuse to provide truthful answer when they were asked directly. Thus, to better protect individuals’ privacy as well as encourage truthful answers, studies on helping collecting sensitive information have been developed rapidly during the last decades. Up to now, three main branches of techniques are introduced for collecting and analyzing data in sample surveys with sensitive questions: the randomized response techniques (Warner, 1965, Horvitz et al., 1967, Greenberg et al., 1969, Fox and Tracy, 1986, Chaudhuri and Mukerjee, 1988, Mangat, 1994, Chaudhuri, 2011), the item count techniques or the unmatched count techniques (Miller, 1984, Dalton et al., 1994, Kuklinski et al., 1997, Gilens et al., 1998, LaBrie and Earleywine, 2000, Tsuchiya, 2005, Janus, 2010, Imai, 2011, Petróczi et al., 2011, Tian et al., 2017, Liu et al., 2019) and the non-randomized response techniques (Tian et al., 2007, Tian et al., 2009, Tian G.L et al., 2011, Yu et al., 2008, Tan et al., 2009, Tang et al., 2009, Liu and Tian, 2013a, Liu and Tian, 2013b, Groenitz, 2014, Tian, 2015).

Only focusing on raising new survey models for sensitive information collection with well privacy protection is not enough, sometimes, researchers may also be interested in finding out which factors may have non-ignorable influence on such sensitive attributes of interest. Thus, to construct appropriate regression models is necessary and the logit model is often a good choice for binary response variable. For the randomized response approaches and the item count approaches, several works have been done on establishing a connection between the binary sensitive respond and other non-sensitive potential explanatory variables (see Maddala, 1983, Scheers and Dayton, 1988, Corstange, 2004, van den Hout et al., 2007, Hsieh et al., 2010). Note that the non-randomized response methods contain advantages such as reproducibility, low-cost, easy understanding for both interviewers and interviewees, strong operability, better privacy protection and so forth, it is worthy of our efforts to discuss regression analyzing ways under the non-randomized response models. However, in the field of the non-randomized response approaches, only Tian et al. (2019) proposed a so-called hidden logit parallel model that taking non-sensitive covariates into consideration while the collection of the sensitive information is aided by Tian’s non-randomized parallel model (Tian, 2015). Nevertheless, the parallel model given by Tian asked the successful proportions of the two auxiliary non-sensitive binary variables should be known before experiments, which may certainly restricted the application range of this model. Luckily, the variant of parallel model put forward by Liu and Tian (2013b) could better overcome such a limitation by letting one successful proportions of the two auxiliary non-sensitive variables to be known while the other one keeps unknown. Therefore, the first objective of this paper is to develop a new variant of parallel regression model to detect elements being responsible for the sensitive attribute of interest.

To our best knowledge, little work has been done on confounding variable identification and the variable selection for analysis in surveys with sensitive characteristics, especially for non-randomized response approaches. However, in regression analysis, other than providing effective regression estimators, these problems are two of the important issues that deserve our attention. Both contribute a lot to enhance the accuracy and precision of the regression model. For the former one, since the confounder has impact on both the explained variable and the explanatory variable(s), it may distort the observed relationship between the exposures and outcomes (Greenland et al., 1999, Pearl, 2009, VanderWeele and Shpitser, 2013). Thus, it is of great necessity to identify the confounding variable(s) in regression analysis such that the true associativity between dependent and independent variables can be revealed. On the other hand, for the latter one, it is helpful in reducing the complexity of the model to a great extend since a large number of covariates are usually chosen at the initial stage of modeling aiming at reducing possible modeling biases. However, too many redundancy variables go against the most basic modeling principle, i.e, the principle of parsimony; that is, the model used should require the smallest possible number of parameters that will adequately represent the data. Various kinds of variable selection techniques have been developed to help selecting significant factors by adding a penalty to the objective function (see Frank and Friedman, 1993, Tibshirani, 1996, Fan and Li, 2001, Zou and Hastie, 2005, Zou, 2006, Zhang, 2010, Wang and Wang, 2014, Wang and Wang, 2016). Fan and Li (2001) have pointed out that a good penalty function could help to obtain an estimator with the following three properties:

1.
Unbiasedness: the resulting estimator is unbiased when the unknown coefficient of the regression is large;
2.
Sparsity: the resulting estimator has an automatic thresholding rule that sets some estimated coefficients to zero in order to reduce the complexity of the model if the estimated values are small;
3.
Continuity: the resulting estimator is a continuous function of data to avoid instability in model prediction.

Consequently, the second objective of this paper is to provide a method of identifying the confounding variable(s) and the third objective is to derive the variable selection technique for the proposed variant of the parallel regression model while the Smoothly Clipped Absolute Deviation (SCAD) method introduced by Fan and Li (2001) is adopted.

The rest of the paper is organized as follows: In Section 2, we first propose the variant of the parallel regression model and then, a Expectation–Maximization embedded with Newton–Raphson (EM–NR) algorithm is derived for calculating the MLEs of the regression coefficients. In Section 3, we provide the approach for identifying confounding variable(s). The method for variable selection based on SCAD penalty was discussed in Section 4, and in the same section, we also present the asymptotic properties for the penalized estimators. Four simulation studies are performed in Section 5 and a real data example about premarital sexual behavior is analyzed in Section 6 to illustrate the proposed methods. Finally, a discussion is given in Section 7.

Section snippets

The variant of the parallel regression model

Let $Q_{Y}$ be the sensitive question like “Have you ever taken subway without a ticket?” and $Y$ be a binary response variable corresponding to it, where $Y = 1$ denotes a “yes” answer to $Q_{Y}$ and $Y = 0$ otherwise. Let $π$ denote the unknown probability of an individual in certain population giving a positive response to $Q_{Y}$ , i.e., $π = Pr (Y = 1)$ . Researchers are often interested in discovering the true value of $π$ or, even more, which attributes may have influences on it. Note that the outcome of the sensitive

Identification of the confounding variables

When constructing a regression model, a potential problem that deserves our attention is that an extra variable – neither the independent nor the dependent variables of interest – may be ignored but this missing element will distort the real relationship between the exposure and outcome, that is a causal relationship is suggested when in fact there is not. For example, a study may incorrectly build a causal relationship between domestic violence and family fortunes because it ignores the effect

Penalized maximum likelihood

Another essential issue when constructing a regression model is how to select vital independent variables from lots of potential factors at the initial stage of modeling and reduce the complexity of the model to the great extend. Although some irrelevant or weak correlated covariates included in the model may make the forecasting effect a bit better, they will certainly increase the modeling cost, decrease the precision of parameter estimation and reduce the accuracy of the model. And too many

Simulation studies

In this section, four simulation studies are performed to illustrate the proposed methods. In the first experiment, a two-regressor equation is employed to examine the estimation performance of our proposed parameter estimation approach for the variant of the parallel regression model. The second experiment is conducted to compare the results based on the logistic assumption and the normal assumption, respectively, to show the rationality of our model choice. In the third experiment, a

A real example of premarital sexual behavior in Wuhan

Talking about personal sexual practice is still a sensitive topic in mainland China. As a consequence, to evaluate the proportion of premarital sexual behavior in a certain population via the direct questionnaires is not an easy job because people are likely to reject to cooperate since the issue would make them feel embarrassed. To examine which factors may contribute to the proportion of premarital sexual activity as well as to illustrate our proposed method, a small survey was conducted in

Discussion

In this paper, we provided a new hidden logit regression model to study the relationship between a sensitive binary response variable and a set of non-sensitive covariates, where the sensitive data is collected through the variant of the parallel model which was originally proposed by Liu and Tian (2013b). This new non-randomized regression model is useful for overcoming the limitations when employing randomized response techniques as well as having a wider range of application than the hidden

CRediT authorship contribution statement

Yin Liu: Conceptualization, Methodology, Software, Simulation, Writing - original draft, Investigation. Guo-Liang Tian: Methodology, Writing - review & editing. Mingqiu Wang: Methodology, Software, Simulation, Writing - review & editing.

Acknowledgments

The authors would like to thank the Executive Editor and the referee for their helpful comments and suggestions, which result in a significant improvement of the manuscript. Y LIU’s research was fully supported by grants (11601524 & 61773401) from National Natural Science Foundation of China. GL TIAN’s research was fully supported by a grant (11771199) from National Natural Science Foundation of China . MQ WANG’s research was supported by the National Natural Science Foundation of China (

References (49)

HorraceW.C. et al.
Results on the bias and inconsistency of ordinary least squares for the linear probability model
Econom. Lett.
(2006)
HsiehS.H. et al.
Logistic regression analysis of randomized response data with missing covariates
J. Statist. Plann. Inference
(2010)
LiuY. et al.
A variant of the parallel model for sample surveys with sensitive characteristics
Comput. Statist. Data Anal.
(2013)
TangM.L. et al.
A new non-randomized multi-category response model for surveys with a single sensitive question: Design and analysis
J. Korean Stat. Soc.
(2009)
van den HoutA. et al.
The logistic regression model with response variables subject to randomized response
Comput. Statist. Data Anal.
(2007)
WangM. et al.
Adaptive lasso estimators for ultrahigh dimensional generalized linear models
Statist. Probab. Lett.
(2014)
ChaudhuriA.
Randomized Response and Indirect Questioning Techniques in Surveys
(2011)
ChaudhuriA. et al.
Randomized Response: Theory and Techniques
(1988)
Corstange, D., 2004. Sensitive questions, truthful responses? Randomized response and hidden logit as a procedure to...
DaltonD.R. et al.
Using the unmatched count technique (UCT) to estimate base-rates for sensitive behavior
Pers. Psychol.
(1994)

FanJ. et al.

Variable selection via nonconcave penalized likelihood and its oracle properties

J. Amer. Statist. Assoc.

(2001)

FanJ. et al.

Nonconcave penalized likelihood with diverging number of parameters

Ann. Statist.

(2004)

FoxJ.A. et al.

Randomized Response: A Method for Sensitive Surveys (Series: Quantitative Applications in the Social Sciences)

(1986)

FrankI.E. et al.

A statistical view of some chemometrics regression tools

Technometrics

(1993)

GilensM. et al.

Affirmative action and the politics of realignment

British J. Political Sci.

(1998)

GreenbergB.G. et al.

The unrelated question randomized response model: Theoretical framework

J. Amer. Statist. Assoc.

(1969)

GreenlandS. et al.

Confounding and Collapsibility in Causal Inference

Statist. Sci.

(1999)

GroenitzH.

A new privacy-protecting survey design for multichotomous sensitive variables

Metrika

(2014)

HennekensC.H. et al.

Epidemiology in Medicine

(1987)

HorvitzD.G. et al.

The unrelated question randomized response model

ImaiK.

Multivariate regression analysis for the item count technique

J. Amer. Statist. Assoc.

(2011)

JanusA.L.

The influence of social desirability pressures on expressed immigration attitudes

Soc. Sci. Quart.

(2010)

KuklinskiJ.H. et al.

Racial attitudes and the new south

J. Politics

(1997)

LaBrieJ.W. et al.

Sexual risk behaviors and alcohol: Higher base rates revealed using the unmatched-count technique

J. Sex Res.

(2000)

Cited by (1)

System identifiability and structure identification: Input and output variables selection based on consistent measures of dependence
2021, IFAC-PapersOnLine
This paper analyses the measures of dependence and properties needed to solve problems of identifiability as well as structure identification of systems – input and output variables selection, and constructive approach corresponding to these problems. The approach is based on applying consistent measures of both bivariate and multivariate dependence of random values and vectors; and enables constructing quantitative indexes to select system input and output variables.

View full text

A new variant of the parallel regression model with variable selection in surveys with sensitive attribute

Highlights

Abstract

Introduction

Section snippets

The variant of the parallel regression model

Identification of the confounding variables

Penalized maximum likelihood

Simulation studies

A real example of premarital sexual behavior in Wuhan

Discussion

CRediT authorship contribution statement

Acknowledgments

Econom. Lett.

J. Statist. Plann. Inference

Comput. Statist. Data Anal.

J. Korean Stat. Soc.

Comput. Statist. Data Anal.

Statist. Probab. Lett.

Randomized Response and Indirect Questioning Techniques in Surveys

Randomized Response: Theory and Techniques

Using the unmatched count technique (UCT) to estimate base-rates for sensitive behavior

Pers. Psychol.

Variable selection via nonconcave penalized likelihood and its oracle properties

J. Amer. Statist. Assoc.

Nonconcave penalized likelihood with diverging number of parameters

Ann. Statist.

Randomized Response: A Method for Sensitive Surveys (Series: Quantitative Applications in the Social Sciences)

A statistical view of some chemometrics regression tools

Technometrics

Affirmative action and the politics of realignment

British J. Political Sci.

The unrelated question randomized response model: Theoretical framework

J. Amer. Statist. Assoc.

Confounding and Collapsibility in Causal Inference

Statist. Sci.

A new privacy-protecting survey design for multichotomous sensitive variables

Metrika

Epidemiology in Medicine

The unrelated question randomized response model

Multivariate regression analysis for the item count technique

J. Amer. Statist. Assoc.

The influence of social desirability pressures on expressed immigration attitudes

Soc. Sci. Quart.

Racial attitudes and the new south

J. Politics

Sexual risk behaviors and alcohol: Higher base rates revealed using the unmatched-count technique

J. Sex Res.