Abstract
Nonparametric estimation for a large-sized sample subject to sampling bias is studied in this paper. The general parameter considered is the mean of a transformation of the random variable of interest. When ignoring the biasing weight function, a small-sized simple random sample of the real population is assumed to be additionally observed. A new nonparametric estimator that incorporates kernel density estimation is proposed. Asymptotic properties for this estimator are obtained under suitable limit conditions on the small and the large sample sizes and standard and non-standard asymptotic conditions on the two bandwidths. Explicit formulas are shown for the particular case of mean estimation. Simulation results show that the new mean estimator outperforms two classical ones for suitable choices of the two smoothing parameters involved. The influence of two smoothing parameters on the performance of the final estimator is also studied, exhibiting a striking limit behavior of their optimal values. The new method is applied to a real data set from the Telco Company Vodafone ES, where a bootstrap algorithm is used to select the smoothing parameter.
Similar content being viewed by others
References
Borrajo L, Cao R (2020) Big-but-biased data analytics for air quality. Electronics 9(9):1551
Calissano A, Vantini S, Arnaboldi M (2018) An elephant in the room: Twitter sampling methodology. MOX-Report 16/2018
Cao R (2015) Inferencia estadística con datos de gran volumen. Gaceta de la Real Soc Mat Española 18(2):393–417
Cao R, Borrajo L (2018) Nonparametric mean estimation for big-but-biased data. In: Gil E, Gil E, Gil J, Gil MA (eds) The Mathematics of the Uncertain, pp 55–65. Springer, Cham
Crawford K (2013) The hidden biases in big data. Harv Bus Rev 1:814
Cristóbal JA, Alcalá JT (2001) An overview of nonparametric contributions to the problem of functional estimation from biased data. Test 10:309–332
Delyon B, Portier F (2016) Integral approximation by kernel smoothing. Bernoulli 22(4):2177–2208
Deville J, Särndal C (1992) Calibration estimators in survey sampling. J Am Stat Assoc 87(418):376–382
Devroye L (1986) Non-uniform random variate generation. Springer, Berlin
Fisher R (1934) The effects of methods of ascertainment upon the estimation of frequencies. Ann Eugen 6:13–25
Genton MG, Kim M, Ma Y (2012) Semiparametric location estimation under non-random sampling. Stat 1(1):1–11
Gill RD, Vardi Y, Wellner JA (1988) Large sample theory of empirical distributions in biased sampling models. Ann Stat 16(3):1069–1112
Hargittai E (2015) Is bigger always better? Potential biases of big data derived from social network sites. Ann Am Acad Polit Soc Sci 659(1):63–76
Kolmogorov A (1933) Sulla determinazione empirica di una legge di distribuzione. Inst Ital Attuari Giorn 4:83–91
Kott PS (2016) Calibration weighting in survey sampling. Wiley Interdiscip Rev Comput Stat 8(1):39–53
Li Q, Racine J (2003) Nonparametric estimation of distributions with categorical and continuous data. J Multivar Anal 86(2):266–292
Lloyd CJ, Jones M (2000) Nonparametric density estimation from biased data with unknown biasing function. J Am Stat Assoc 95(451):865–876
Ma Y, Genton MG, Tsiatis AA (2005) Locally efficient semiparametric estimators for generalized skew-elliptical distributions. J Am Stat Assoc 100(471):980–989
Ma Y, Kim M, Genton MG (2013) Semiparametric efficient and robust estimation of an unknown symmetric population under arbitrary sample selection bias. J Am Stat Assoc 108(503):1090–1104
Montanari GE, Ranalli MG (2005) Nonparametric model calibration estimation in survey sampling. J Am Stat Assoc 100(472):1429–1442
Parzen E (1962) On estimation of a probability density function and mode. Ann Math Stat 33(3):1065–1076
Patil GP, Rao CR (1978) Weighted distributions and size biased sampling with applications to wildlife populations and human families. Biometrics 34:179–189
Rosenblatt M (1956) Remarks on some nonparametric estimates of a density function. Ann Math Stat 27:832–837
Smirnov NV (1939) Estimate of deviation between empirical distribution functions in two independent samples. Bull Mosc Univ 2(2):3–16
Vardi Y (1985) Empirical distributions in selection bias models. Ann Stat 13(1):178–203
Acknowledgements
This research has been supported by MINECO Grant MTM2017-82724-R and by the Xunta de Galicia (Grupos de Referencia Competitiva ED431C-2016-015 and ED431C-2020-14, Centro Singular de Investigación de Galicia ED431G/01 and Centro de Investigación del Sistema Universitario de Galicia ED431G 2019/01), all of them through the European Regional Development Fund (ERDF). The authors would like to thank Vodafone ES for their collaboration and especially Alberto de Santos and the Big Data and Analytics team of this company for suggesting this problem and providing the data set used in this paper. The authors would like to thank two anonymous reviewers for their valuable comments and suggestions which led to an improved version of this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
11749_2020_749_MOESM1_ESM.pdf
Supplementary material: The online supplementary material contain the proofs of Lemma 1, Subsection 2.3, Theorem 1 and Theorem 2.
Sketch of the proof
Sketch of the proof
Only the sketches of the proof of Theorems 1 and 2 are included here.
1.1 Sketch of the proof of Theorem 1
Let us first state an auxiliary lemma, whose proof can be found in the supplementary materials.
Lemma 2
The difference \({\hat{\mu }}_v - \mu _v\) can be expressed as follows:
where
and
The term in (17) can be splitted in different terms, \({\widehat{A}} ={\widehat{A}}_1 +{\widehat{A}}_2 - {\widehat{A}}_3 - {\widehat{A}}_4 + {\widehat{A}}_5,\) where
Since the terms \({\widehat{A}}_4\) and \({\widehat{A}}_5\) have some factors of quadratic nature inside the sum (i.e., \(({\hat{f}}_h(Y_i)-f(Y_i)) ( {\hat{g}}_b(Y_i)-g(Y_i) )\) and \( ( {\hat{g}}_b(Y_i)-g(Y_i) )^2\)) they are negligible with respect to other terms. Consequently, \({\widehat{A}} \simeq {\widehat{A}}_1 +{\widehat{A}}_2 - {\widehat{A}}_3.\)
Lemma 3
The expectation and variance of \({\widehat{A}}\) can be approximated by
The proof of Theorem 1 proceeds by analyzing the expectations and variances involved. It is available in the supplementary materials.
1.2 Sketch of the proof of Theorem 2
Using Lemma 2, in this case \({\widehat{A}}\) can be expressed as: \({\widehat{A}} ={\widehat{A}}_1^* +{\widehat{A}}_2^* - {\widehat{A}}_3^* - {\widehat{A}}_4^* + {\widehat{A}}_5^*,\) where
being \({\widehat{A}}_4^*\) and \({\widehat{A}}_5^*\) negligible terms. Thus we will consider \({\widehat{A}}^* :={\widehat{A}}_1^* +{\widehat{A}}_2^* - {\widehat{A}}_3^*. \)
The proof of Theorem 2 follows parallel lines to that of Theorem 1. It is available in the supplementary materials.
Rights and permissions
About this article
Cite this article
Borrajo, L., Cao, R. Nonparametric estimation for big-but-biased data. TEST 30, 861–883 (2021). https://doi.org/10.1007/s11749-020-00749-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11749-020-00749-5