Abstract
In this paper, we tackle the problem of splitting a long (potentially time consuming) questionnaire into two parts, where each participant only responds to a fraction of the questions, and all respondents obtain a common portion of questions. We propose a method that combines regression models to the two independent samples (questionnaires) in the survey. Each sample includes the common response variable Y and common covariate x, while two vectors of specific covariates z and w are recorded such that no single sampling unit has answered both z and w. This corresponds to the problem of statistical matching that we tackle under the assumption of conditional independence. In the statistical matching context, we use a macro approach to estimate parameters of a regression model. This means that we can estimate the joint distribution of all variables of interest with available data utilizing the assumption of conditional independence. We make use of this here by fitting three regression models with the same response variable for each model. Combining the three models allows us to obtain a prediction model with all covariates in common. We compare the performance of our proposed method in simulation studies as well as a real data example. Our method gives better results as compared to commonly used alternative methods. The proposed routine is easy to apply in practice and it neither requires the formulation of a model for the covariates itself nor an imputation model for the missing covariates vectors z and w.
Similar content being viewed by others
References
Burgette LF, Reiter JP (2010) Multiple imputation for missing data via sequential regression trees. Am J Epidemiol 172(9):1070–1076
Chipperfield JO, Steel DG (2009) Design and estimation for split questionnaire surveys. J Offic Stat 25(2):227–244
Cutillo A, Scanu M (2020) A mixed approach for data fusion of HBS and SILC. J Soc Indic Res. https://doi.org/10.1007/s11205-020-02316-9
Donatiello G, D’Orazio M, Frattarola D, Rizzi A, Scanu M, Spaziani M (2016) The role of the conditional independence assumption in statistically matching income and consumption. Stat J IAOS 32:667–675
D’Orazio M (2015) Integration and imputation of survey data in R: the StatMatch package. J Rom Stat Rev 2:57–68
D’Orazio M, Di Zio M, Scanu M (2006a) Statistical matching: theory and practice. Wiley, New York
D’Orazio M, Di Zio M, Scanu M (2006b) Statistical matching for categorical data: displaying uncertainty and using logical constraints. J Offic Stat 22:137–157
Doretti M, Geneletti S, Stanghellini E (2018) Missing data: a unified taxonomy guided by conditional independence. Int Stat Rev 86(2):189–204
Endres E (2019) Statistical matching meets probabilistic graphical models: contributions to categorical data fusion. Ph.D. Dissertation. Ludwig-Maximilians-University Munich
Endres E, Augustin T (2016) Statistical matching of discrete data by Bayesian networks. Proc Eight Int Conf Probabil Graph Mod Proc Mach Learn Res 52:159–170
Endres E, Augustin T (2019) Utilizing log-linear Markov networks to integrate categorical data files, Technical Report 222. Department of Statistics, LMU Munich
Fahrmeir L, Kenib T, Lang S, Marx B (2013) Regression-models, methods and applications. Springer, Berlin
Fitzenberger B, Fuchs B (2017) The residency discount for rents in Germany and the tenancy law reform act 2001: evidence from quantile regressions. German Econ Rev 18(2):212–236
Graham JW, Taylor BJ, Olchowski AE, Cumsille PE (2006) Planned missing data designs in psychological research. Psychol Methods 11(4):323–343
Kamgar S, Navvabpour H (2017) An efficient method for estimating population parameters using split questionnaire design. J Stat Res Iran 14(1):77–99
Kamgar S, Meinfelder F, Münnich R (2018) Estimation within the new integrated system of household surveys in Germany. J Stat Pap 1–27
Kaplan D, McCarty AT (2013) Data fusion with international large scale assessments: a case study using the OECD PISA and TALIS surveys. Large-Scale Assess Educ. https://doi.org/10.1186/2196-0739-1-6
Kauermann G, Ali M (2020) Semi-parametric regression when some (expensive) covariates are missing by design. J Stat Pap 1–22. https://doi.org/10.1007/s00362-019-01152-5
Kim K, Park M (2019) Statistical micro matching using a multinomial logistic regression model for categorical data. Commun Stat Appl Methods 26(5):507–517
Kim JK, Berg E, Park T (2016) Statistical matching using fractional imputation. Surv Methodol 42(1):19–40
Little RJA (1992) Regression with missing X’s: a review. J Am Stat Assoc 87(420):1227–1237
Little RJ, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, London. https://doi.org/10.1002/9781119013563
Moriarity C, Scheuren F (2001) Statistical matching: a paradigm for assessing the uncertainty in the procedure. J Offic Stat 17(3):407–422
Peytchev A, Peytcheva E (2017) Reduction of measurement error due to survey length: evaluation of the split questionnaire design approach. Surv Res Methods 11(4):361–368
Pigott TD (2001) A review of methods for missing data. Educ Res Eval 7(4):3535–3830
Raghunathan TE, Grizzle JE (1995) A split questionnaire survey design. J Am Stat Assoc 90(429):54–63
Rässler S (2002) Statistical matching: a frequentist theory, practical applications, and alternative bayesian approaches. Springer, New York. https://doi.org/10.1007/978-1-4613-0053-3
Rässler S (2004) Data fusion: identification problems, validity, and multiple imputation. Austrian J Stat 33:153–171
Rendall MS, Dastidar BG, Weden MM, Baker EH, Nazarov Z (2013) Multiple imputation for combined-survey estimation with incomplete regressors in one but not both surveys. Sociol Methods Res 42(4):483–530
Roszka W (2015) Some practical issues related to the integration of data from sample surveys. Statistika: Stat Econ J 95(1):60–75
Rubin DB (1986) Statistical matching using file concatenation with adjusted weights and multiple imputations. J Bus Econ Stat 4(1):87–94
Singh AC, Mantel H, Kinack M, Rowe G (1993) Statistical matching: use of auxiliary information as an alternative to the conditional independence assumption. Surv Methodol 19:59–79
Stuart M, Yu C (2019) A computationally efficient method for selecting a split questionnaire design. Creat Compon. https://lib.dr.iastate.edu/creativecomponents/252
Van Buuren S, Groothuis-Oudshoorn K (2011) mice: Multivariate imputation by chained equations in R. J Stat Softw 45(3):1–67
Vantaggi B (2008) Statistical matching of multiple sources: a look through coherence. Int J Approx Reason 49:701–711
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Variables list for rent data example
Appendix: Variables list for rent data example
Common variables Y, x | Component 1 z | Component 2 w | Samples |
---|---|---|---|
\(\hbox {Y} =\) rent per square meter (in Euros), x = the floor space | \(\hbox {z1} = 1\) if the apartment does not have an upmarket kitchen, | Missing | \(\hbox {S}_{\mathrm{a}}\) |
\(\hbox {z2} = 1\) if the apartment has an open kitchen, | |||
\(\hbox {z3} = 1\) if the apartment lies in an apartment type building, | |||
\(\hbox {z4} = 1\) if the apartment lies in an old building, | |||
\(\hbox {z5} = 1\) if the apartment is located in a back premises, | |||
\(\hbox {z6} = 1\) if apartment has standard central heating, | |||
\(\hbox {z7} = 1\) if the apartment has under floor heating | |||
Missing | \(\hbox {w1} = 1\) if the apartment has good bathroom equipment, | \(\hbox {S}_{\mathrm{b}}\) | |
\(\hbox {w2} = 1\) if the apartment lies in an average residential location, | |||
\(\hbox {w3} = 1\) if the apartment has a second rest room, | |||
\(\hbox {w4}= 1\) if the apartment has a new floor, | |||
\(\hbox {w5} = 1\) if the apartment has a bad floor, | |||
\(\hbox {w6} = 1\) if the apartment has a good floor, | |||
\(\hbox {w7} = 1\) if the apartment lies in a ground floor |
Rights and permissions
About this article
Cite this article
Ali, M., Kauermann, G. A split questionnaire survey design in the context of statistical matching. Stat Methods Appl 30, 1219–1236 (2021). https://doi.org/10.1007/s10260-020-00554-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10260-020-00554-2