Nonparametric estimation for big-but-biased data

Borrajo, Laura; Cao, Ricardo

doi:10.1007/s11749-020-00749-5

Nonparametric estimation for big-but-biased data

Original Paper
Published: 26 January 2021

Volume 30, pages 861–883, (2021)
Cite this article

TEST Aims and scope Submit manuscript

362 Accesses
4 Altmetric
Explore all metrics

Abstract

Nonparametric estimation for a large-sized sample subject to sampling bias is studied in this paper. The general parameter considered is the mean of a transformation of the random variable of interest. When ignoring the biasing weight function, a small-sized simple random sample of the real population is assumed to be additionally observed. A new nonparametric estimator that incorporates kernel density estimation is proposed. Asymptotic properties for this estimator are obtained under suitable limit conditions on the small and the large sample sizes and standard and non-standard asymptotic conditions on the two bandwidths. Explicit formulas are shown for the particular case of mean estimation. Simulation results show that the new mean estimator outperforms two classical ones for suitable choices of the two smoothing parameters involved. The influence of two smoothing parameters on the performance of the final estimator is also studied, exhibiting a striking limit behavior of their optimal values. The new method is applied to a real data set from the Telco Company Vodafone ES, where a bootstrap algorithm is used to select the smoothing parameter.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Entropy-Based Subsampling Methods for Big Data

Article 11 April 2024

Check your outliers! An introduction to identifying statistical outliers in R with easystats

Article 25 March 2024

Expectile and M-quantile regression for panel data

Article 17 March 2024

References

Borrajo L, Cao R (2020) Big-but-biased data analytics for air quality. Electronics 9(9):1551
Article Google Scholar
Calissano A, Vantini S, Arnaboldi M (2018) An elephant in the room: Twitter sampling methodology. MOX-Report 16/2018
Cao R (2015) Inferencia estadística con datos de gran volumen. Gaceta de la Real Soc Mat Española 18(2):393–417
MathSciNet Google Scholar
Cao R, Borrajo L (2018) Nonparametric mean estimation for big-but-biased data. In: Gil E, Gil E, Gil J, Gil MA (eds) The Mathematics of the Uncertain, pp 55–65. Springer, Cham
Crawford K (2013) The hidden biases in big data. Harv Bus Rev 1:814
Google Scholar
Cristóbal JA, Alcalá JT (2001) An overview of nonparametric contributions to the problem of functional estimation from biased data. Test 10:309–332
Article MathSciNet Google Scholar
Delyon B, Portier F (2016) Integral approximation by kernel smoothing. Bernoulli 22(4):2177–2208
Article MathSciNet Google Scholar
Deville J, Särndal C (1992) Calibration estimators in survey sampling. J Am Stat Assoc 87(418):376–382
Article MathSciNet Google Scholar
Devroye L (1986) Non-uniform random variate generation. Springer, Berlin
Book Google Scholar
Fisher R (1934) The effects of methods of ascertainment upon the estimation of frequencies. Ann Eugen 6:13–25
Article Google Scholar
Genton MG, Kim M, Ma Y (2012) Semiparametric location estimation under non-random sampling. Stat 1(1):1–11
Article MathSciNet Google Scholar
Gill RD, Vardi Y, Wellner JA (1988) Large sample theory of empirical distributions in biased sampling models. Ann Stat 16(3):1069–1112
Article MathSciNet Google Scholar
Hargittai E (2015) Is bigger always better? Potential biases of big data derived from social network sites. Ann Am Acad Polit Soc Sci 659(1):63–76
Article Google Scholar
Kolmogorov A (1933) Sulla determinazione empirica di una legge di distribuzione. Inst Ital Attuari Giorn 4:83–91
MATH Google Scholar
Kott PS (2016) Calibration weighting in survey sampling. Wiley Interdiscip Rev Comput Stat 8(1):39–53
Article MathSciNet Google Scholar
Li Q, Racine J (2003) Nonparametric estimation of distributions with categorical and continuous data. J Multivar Anal 86(2):266–292
Article MathSciNet Google Scholar
Lloyd CJ, Jones M (2000) Nonparametric density estimation from biased data with unknown biasing function. J Am Stat Assoc 95(451):865–876
Article MathSciNet Google Scholar
Ma Y, Genton MG, Tsiatis AA (2005) Locally efficient semiparametric estimators for generalized skew-elliptical distributions. J Am Stat Assoc 100(471):980–989
Article MathSciNet Google Scholar
Ma Y, Kim M, Genton MG (2013) Semiparametric efficient and robust estimation of an unknown symmetric population under arbitrary sample selection bias. J Am Stat Assoc 108(503):1090–1104
Article MathSciNet Google Scholar
Montanari GE, Ranalli MG (2005) Nonparametric model calibration estimation in survey sampling. J Am Stat Assoc 100(472):1429–1442
Article MathSciNet Google Scholar
Parzen E (1962) On estimation of a probability density function and mode. Ann Math Stat 33(3):1065–1076
Article MathSciNet Google Scholar
Patil GP, Rao CR (1978) Weighted distributions and size biased sampling with applications to wildlife populations and human families. Biometrics 34:179–189
Article MathSciNet Google Scholar
Rosenblatt M (1956) Remarks on some nonparametric estimates of a density function. Ann Math Stat 27:832–837
Article MathSciNet Google Scholar
Smirnov NV (1939) Estimate of deviation between empirical distribution functions in two independent samples. Bull Mosc Univ 2(2):3–16
Google Scholar
Vardi Y (1985) Empirical distributions in selection bias models. Ann Stat 13(1):178–203
Article MathSciNet Google Scholar

Download references

Acknowledgements

This research has been supported by MINECO Grant MTM2017-82724-R and by the Xunta de Galicia (Grupos de Referencia Competitiva ED431C-2016-015 and ED431C-2020-14, Centro Singular de Investigación de Galicia ED431G/01 and Centro de Investigación del Sistema Universitario de Galicia ED431G 2019/01), all of them through the European Regional Development Fund (ERDF). The authors would like to thank Vodafone ES for their collaboration and especially Alberto de Santos and the Big Data and Analytics team of this company for suggesting this problem and providing the data set used in this paper. The authors would like to thank two anonymous reviewers for their valuable comments and suggestions which led to an improved version of this paper.

Author information

Authors and Affiliations

Research Group MODES, Department of Mathematics, CITIC, University of A Coruña, 15071, A Coruña, Spain
Laura Borrajo
Research Group MODES, Department of Mathematics, CITIC and ITMATI, University of A Coruña, 15071, A Coruña, Spain
Ricardo Cao

Authors

Laura Borrajo
View author publications
You can also search for this author in PubMed Google Scholar
Ricardo Cao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Laura Borrajo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

11749_2020_749_MOESM1_ESM.pdf

Supplementary material: The online supplementary material contain the proofs of Lemma 1, Subsection 2.3, Theorem 1 and Theorem 2.

Sketch of the proof

Only the sketches of the proof of Theorems 1 and 2 are included here.

1.1 Sketch of the proof of Theorem 1

Let us first state an auxiliary lemma, whose proof can be found in the supplementary materials.

Lemma 2

The difference ${\hat{\mu }}_v - \mu _v$ can be expressed as follows:

$$\begin{aligned} {\hat{\mu }}_v - \mu _v={\widehat{A}} +{\widehat{A}}\left( 1-{\widehat{B}}\right) +\dfrac{{\widehat{A}}\left( 1-{\widehat{B}}\right) ^2}{{\widehat{B}}}\simeq {\widehat{A}}, \end{aligned}$$

(16)

where

$$\begin{aligned} {\widehat{A}}=\dfrac{1}{N}\displaystyle \sum _{i=1}^{N}{\dfrac{{\hat{f}}_h(Y_i)}{{\hat{g}}_b(Y_i)}(v(Y_i)-\mu _v)} \end{aligned}$$

(17)

and

$$\begin{aligned} {\widehat{B}}=\dfrac{1}{N}\displaystyle \sum _{i=1}^{N}{\dfrac{{\hat{f}}_h(Y_i)}{{\hat{g}}_b(Y_i)}}. \end{aligned}$$

(18)

The term in (17) can be splitted in different terms, ${\widehat{A}} ={\widehat{A}}_1 +{\widehat{A}}_2 - {\widehat{A}}_3 - {\widehat{A}}_4 + {\widehat{A}}_5,$ where

$$\begin{aligned} {\widehat{A}}_1:= & {} \dfrac{1}{N}\displaystyle \sum _{i=1}^{N}{\dfrac{f(Y_i)}{g(Y_i)}(v(Y_i)-\mu _v)}, \\ {\widehat{A}}_2:= & {} \dfrac{1}{N}\displaystyle \sum _{i=1}^{N}{\dfrac{{\hat{f}}_h(Y_i)-f(Y_i)}{g(Y_i)}(v(Y_i)-\mu _v)}, \\ {\widehat{A}}_3:= & {} \dfrac{1}{N}\displaystyle \sum _{i=1}^{N}{\dfrac{f(Y_i) ( {\hat{g}}_b(Y_i)-g(Y_i) )}{g(Y_i)^2}(v(Y_i)-\mu _v)}, \\ {\widehat{A}}_4:= & {} \dfrac{1}{N}\displaystyle \sum _{i=1}^{N}{\dfrac{({\hat{f}}_h(Y_i)-f(Y_i)) ( {\hat{g}}_b(Y_i)-g(Y_i) )}{g(Y_i)^2}(v(Y_i)-\mu _v)}, \\ {\widehat{A}}_5:= & {} \dfrac{1}{N}\displaystyle \sum _{i=1}^{N}{\dfrac{{\hat{f}}_h(Y_i)}{{\hat{g}}_b(Y_i)}\left( \dfrac{{\hat{g}}_b(Y_i)-g(Y_i)}{g(Y_i)}\right) ^2(v(Y_i)-\mu _v)}. \end{aligned}$$

Since the terms ${\widehat{A}}_4$ and ${\widehat{A}}_5$ have some factors of quadratic nature inside the sum (i.e., $({\hat{f}}_h(Y_i)-f(Y_i)) ( {\hat{g}}_b(Y_i)-g(Y_i) )$ and $ ( {\hat{g}}_b(Y_i)-g(Y_i) )^2$) they are negligible with respect to other terms. Consequently, ${\widehat{A}} \simeq {\widehat{A}}_1 +{\widehat{A}}_2 - {\widehat{A}}_3.$

Lemma 3

The expectation and variance of ${\widehat{A}}$ can be approximated by

$$\begin{aligned} E\left( {\widehat{A}}\right)\simeq & {} E\left( {\widehat{A}}_1\right) +E\left( {\widehat{A}}_2\right) - E\left( {\widehat{A}}_3\right) , \end{aligned}$$

(19)

$$\begin{aligned} Var\left( {\widehat{A}}\right)\simeq & {} Var\left( {\widehat{A}}_1\right) +Var\left( {\widehat{A}}_2\right) + Var\left( {\widehat{A}}_3\right) \nonumber \\+ & {} 2 Cov\left( {\widehat{A}}_1, {\widehat{A}}_2\right) -2Cov\left( {\widehat{A}}_1, {\widehat{A}}_3\right) - 2Cov\left( {\widehat{A}}_2, {\widehat{A}}_3\right) . \end{aligned}$$

(20)

The proof of Theorem 1 proceeds by analyzing the expectations and variances involved. It is available in the supplementary materials.

1.2 Sketch of the proof of Theorem 2

Using Lemma 2, in this case ${\widehat{A}}$ can be expressed as: ${\widehat{A}} ={\widehat{A}}_1^* +{\widehat{A}}_2^* - {\widehat{A}}_3^* - {\widehat{A}}_4^* + {\widehat{A}}_5^*,$ where

$$\begin{aligned} {\widehat{A}}_1^*:= & {} \dfrac{1}{N}\displaystyle \sum _{i=1}^{N}{\dfrac{(K_h*f)(Y_i)}{(K_b*g)(Y_i)}(v(Y_i)-\mu _v)}, \\ {\widehat{A}}_2^*:= & {} \dfrac{1}{N}\displaystyle \sum _{i=1}^{N}{\dfrac{{\hat{f}}_h(Y_i)-(K_h*f)(Y_i)}{(K_b*g)(Y_i)}(v(Y_i)-\mu _v)}, \\ {\widehat{A}}_3^*:= & {} \dfrac{1}{N}\displaystyle \sum _{i=1}^{N}{\dfrac{(K_h*f)(Y_i) ( {\hat{g}}_b(Y_i)-(K_b*g)(Y_i) )}{(K_b*g)(Y_i)^2}(v(Y_i)-\mu _v)}, \\ {\widehat{A}}_4^*:= & {} \dfrac{1}{N}\displaystyle \sum _{i=1}^{N}{\dfrac{({\hat{f}}_h(Y_i)-(K_h*f)(Y_i)) ( {\hat{g}}_b(Y_i)-(K_b*g)(Y_i) )}{(K_b*g)(Y_i)^2}(v(Y_i)-\mu _v)}, \\ {\widehat{A}}_5^*:= & {} \dfrac{1}{N}\displaystyle \sum _{i=1}^{N}{\dfrac{{\hat{f}}_h(Y_i)}{{\hat{g}}_b(Y_i)}\left( \dfrac{{\hat{g}}_b(Y_i)-(K_b*g)(Y_i)}{(K_b*g)(Y_i)}\right) ^2(v(Y_i)-\mu _v)}, \end{aligned}$$

being ${\widehat{A}}_4^*$ and ${\widehat{A}}_5^*$ negligible terms. Thus we will consider ${\widehat{A}}^* :={\widehat{A}}_1^* +{\widehat{A}}_2^* - {\widehat{A}}_3^*. $

The proof of Theorem 2 follows parallel lines to that of Theorem 1. It is available in the supplementary materials.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Borrajo, L., Cao, R. Nonparametric estimation for big-but-biased data. TEST 30, 861–883 (2021). https://doi.org/10.1007/s11749-020-00749-5

Download citation

Received: 01 November 2019
Accepted: 12 December 2020
Published: 26 January 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s11749-020-00749-5

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Nonparametric estimation for big-but-biased data

Abstract

Access this article

Similar content being viewed by others

Entropy-Based Subsampling Methods for Big Data

Check your outliers! An introduction to identifying statistical outliers in R with easystats

Expectile and M-quantile regression for panel data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

11749_2020_749_MOESM1_ESM.pdf

Sketch of the proof

1.1 Sketch of the proof of Theorem 1

Lemma 2

Lemma 3

1.2 Sketch of the proof of Theorem 2

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Nonparametric estimation for big-but-biased data

Abstract

Access this article

Similar content being viewed by others

Entropy-Based Subsampling Methods for Big Data

Check your outliers﻿! An introduction to identifying statistical outliers in R with easystats

Expectile and M-quantile regression for panel data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

11749_2020_749_MOESM1_ESM.pdf

Sketch of the proof

Sketch of the proof

1.1 Sketch of the proof of Theorem 1

Lemma 2

Lemma 3

1.2 Sketch of the proof of Theorem 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation

Check your outliers! An introduction to identifying statistical outliers in R with easystats