Skip to main content
Log in

Nonparametric estimation for big-but-biased data

  • Original Paper
  • Published:
TEST Aims and scope Submit manuscript

Abstract

Nonparametric estimation for a large-sized sample subject to sampling bias is studied in this paper. The general parameter considered is the mean of a transformation of the random variable of interest. When ignoring the biasing weight function, a small-sized simple random sample of the real population is assumed to be additionally observed. A new nonparametric estimator that incorporates kernel density estimation is proposed. Asymptotic properties for this estimator are obtained under suitable limit conditions on the small and the large sample sizes and standard and non-standard asymptotic conditions on the two bandwidths. Explicit formulas are shown for the particular case of mean estimation. Simulation results show that the new mean estimator outperforms two classical ones for suitable choices of the two smoothing parameters involved. The influence of two smoothing parameters on the performance of the final estimator is also studied, exhibiting a striking limit behavior of their optimal values. The new method is applied to a real data set from the Telco Company Vodafone ES, where a bootstrap algorithm is used to select the smoothing parameter.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Borrajo L, Cao R (2020) Big-but-biased data analytics for air quality. Electronics 9(9):1551

    Article  Google Scholar 

  • Calissano A, Vantini S, Arnaboldi M (2018) An elephant in the room: Twitter sampling methodology. MOX-Report 16/2018

  • Cao R (2015) Inferencia estadística con datos de gran volumen. Gaceta de la Real Soc Mat Española 18(2):393–417

    MathSciNet  Google Scholar 

  • Cao R, Borrajo L (2018) Nonparametric mean estimation for big-but-biased data. In: Gil E, Gil E, Gil J, Gil MA (eds) The Mathematics of the Uncertain, pp 55–65. Springer, Cham

  • Crawford K (2013) The hidden biases in big data. Harv Bus Rev 1:814

    Google Scholar 

  • Cristóbal JA, Alcalá JT (2001) An overview of nonparametric contributions to the problem of functional estimation from biased data. Test 10:309–332

    Article  MathSciNet  Google Scholar 

  • Delyon B, Portier F (2016) Integral approximation by kernel smoothing. Bernoulli 22(4):2177–2208

    Article  MathSciNet  Google Scholar 

  • Deville J, Särndal C (1992) Calibration estimators in survey sampling. J Am Stat Assoc 87(418):376–382

    Article  MathSciNet  Google Scholar 

  • Devroye L (1986) Non-uniform random variate generation. Springer, Berlin

    Book  Google Scholar 

  • Fisher R (1934) The effects of methods of ascertainment upon the estimation of frequencies. Ann Eugen 6:13–25

    Article  Google Scholar 

  • Genton MG, Kim M, Ma Y (2012) Semiparametric location estimation under non-random sampling. Stat 1(1):1–11

    Article  MathSciNet  Google Scholar 

  • Gill RD, Vardi Y, Wellner JA (1988) Large sample theory of empirical distributions in biased sampling models. Ann Stat 16(3):1069–1112

    Article  MathSciNet  Google Scholar 

  • Hargittai E (2015) Is bigger always better? Potential biases of big data derived from social network sites. Ann Am Acad Polit Soc Sci 659(1):63–76

    Article  Google Scholar 

  • Kolmogorov A (1933) Sulla determinazione empirica di una legge di distribuzione. Inst Ital Attuari Giorn 4:83–91

    MATH  Google Scholar 

  • Kott PS (2016) Calibration weighting in survey sampling. Wiley Interdiscip Rev Comput Stat 8(1):39–53

    Article  MathSciNet  Google Scholar 

  • Li Q, Racine J (2003) Nonparametric estimation of distributions with categorical and continuous data. J Multivar Anal 86(2):266–292

    Article  MathSciNet  Google Scholar 

  • Lloyd CJ, Jones M (2000) Nonparametric density estimation from biased data with unknown biasing function. J Am Stat Assoc 95(451):865–876

    Article  MathSciNet  Google Scholar 

  • Ma Y, Genton MG, Tsiatis AA (2005) Locally efficient semiparametric estimators for generalized skew-elliptical distributions. J Am Stat Assoc 100(471):980–989

    Article  MathSciNet  Google Scholar 

  • Ma Y, Kim M, Genton MG (2013) Semiparametric efficient and robust estimation of an unknown symmetric population under arbitrary sample selection bias. J Am Stat Assoc 108(503):1090–1104

    Article  MathSciNet  Google Scholar 

  • Montanari GE, Ranalli MG (2005) Nonparametric model calibration estimation in survey sampling. J Am Stat Assoc 100(472):1429–1442

    Article  MathSciNet  Google Scholar 

  • Parzen E (1962) On estimation of a probability density function and mode. Ann Math Stat 33(3):1065–1076

    Article  MathSciNet  Google Scholar 

  • Patil GP, Rao CR (1978) Weighted distributions and size biased sampling with applications to wildlife populations and human families. Biometrics 34:179–189

    Article  MathSciNet  Google Scholar 

  • Rosenblatt M (1956) Remarks on some nonparametric estimates of a density function. Ann Math Stat 27:832–837

    Article  MathSciNet  Google Scholar 

  • Smirnov NV (1939) Estimate of deviation between empirical distribution functions in two independent samples. Bull Mosc Univ 2(2):3–16

    Google Scholar 

  • Vardi Y (1985) Empirical distributions in selection bias models. Ann Stat 13(1):178–203

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This research has been supported by MINECO Grant MTM2017-82724-R and by the Xunta de Galicia (Grupos de Referencia Competitiva ED431C-2016-015 and ED431C-2020-14, Centro Singular de Investigación de Galicia ED431G/01 and Centro de Investigación del Sistema Universitario de Galicia ED431G 2019/01), all of them through the European Regional Development Fund (ERDF). The authors would like to thank Vodafone ES for their collaboration and especially Alberto de Santos and the Big Data and Analytics team of this company for suggesting this problem and providing the data set used in this paper. The authors would like to thank two anonymous reviewers for their valuable comments and suggestions which led to an improved version of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Laura Borrajo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

11749_2020_749_MOESM1_ESM.pdf

Supplementary material: The online supplementary material contain the proofs of Lemma 1, Subsection 2.3, Theorem 1 and Theorem 2.

Sketch of the proof

Sketch of the proof

Only the sketches of the proof of Theorems 1 and 2 are included here.

1.1 Sketch of the proof of Theorem 1

Let us first state an auxiliary lemma, whose proof can be found in the supplementary materials.

Lemma 2

The difference \({\hat{\mu }}_v - \mu _v\) can be expressed as follows:

$$\begin{aligned} {\hat{\mu }}_v - \mu _v={\widehat{A}} +{\widehat{A}}\left( 1-{\widehat{B}}\right) +\dfrac{{\widehat{A}}\left( 1-{\widehat{B}}\right) ^2}{{\widehat{B}}}\simeq {\widehat{A}}, \end{aligned}$$
(16)

where

$$\begin{aligned} {\widehat{A}}=\dfrac{1}{N}\displaystyle \sum _{i=1}^{N}{\dfrac{{\hat{f}}_h(Y_i)}{{\hat{g}}_b(Y_i)}(v(Y_i)-\mu _v)} \end{aligned}$$
(17)

and

$$\begin{aligned} {\widehat{B}}=\dfrac{1}{N}\displaystyle \sum _{i=1}^{N}{\dfrac{{\hat{f}}_h(Y_i)}{{\hat{g}}_b(Y_i)}}. \end{aligned}$$
(18)

The term in (17) can be splitted in different terms, \({\widehat{A}} ={\widehat{A}}_1 +{\widehat{A}}_2 - {\widehat{A}}_3 - {\widehat{A}}_4 + {\widehat{A}}_5,\) where

$$\begin{aligned} {\widehat{A}}_1:= & {} \dfrac{1}{N}\displaystyle \sum _{i=1}^{N}{\dfrac{f(Y_i)}{g(Y_i)}(v(Y_i)-\mu _v)}, \\ {\widehat{A}}_2:= & {} \dfrac{1}{N}\displaystyle \sum _{i=1}^{N}{\dfrac{{\hat{f}}_h(Y_i)-f(Y_i)}{g(Y_i)}(v(Y_i)-\mu _v)}, \\ {\widehat{A}}_3:= & {} \dfrac{1}{N}\displaystyle \sum _{i=1}^{N}{\dfrac{f(Y_i) ( {\hat{g}}_b(Y_i)-g(Y_i) )}{g(Y_i)^2}(v(Y_i)-\mu _v)}, \\ {\widehat{A}}_4:= & {} \dfrac{1}{N}\displaystyle \sum _{i=1}^{N}{\dfrac{({\hat{f}}_h(Y_i)-f(Y_i)) ( {\hat{g}}_b(Y_i)-g(Y_i) )}{g(Y_i)^2}(v(Y_i)-\mu _v)}, \\ {\widehat{A}}_5:= & {} \dfrac{1}{N}\displaystyle \sum _{i=1}^{N}{\dfrac{{\hat{f}}_h(Y_i)}{{\hat{g}}_b(Y_i)}\left( \dfrac{{\hat{g}}_b(Y_i)-g(Y_i)}{g(Y_i)}\right) ^2(v(Y_i)-\mu _v)}. \end{aligned}$$

Since the terms \({\widehat{A}}_4\) and \({\widehat{A}}_5\) have some factors of quadratic nature inside the sum (i.e., \(({\hat{f}}_h(Y_i)-f(Y_i)) ( {\hat{g}}_b(Y_i)-g(Y_i) )\) and \( ( {\hat{g}}_b(Y_i)-g(Y_i) )^2\)) they are negligible with respect to other terms. Consequently, \({\widehat{A}} \simeq {\widehat{A}}_1 +{\widehat{A}}_2 - {\widehat{A}}_3.\)

Lemma 3

The expectation and variance of \({\widehat{A}}\) can be approximated by

$$\begin{aligned} E\left( {\widehat{A}}\right)\simeq & {} E\left( {\widehat{A}}_1\right) +E\left( {\widehat{A}}_2\right) - E\left( {\widehat{A}}_3\right) , \end{aligned}$$
(19)
$$\begin{aligned} Var\left( {\widehat{A}}\right)\simeq & {} Var\left( {\widehat{A}}_1\right) +Var\left( {\widehat{A}}_2\right) + Var\left( {\widehat{A}}_3\right) \nonumber \\+ & {} 2 Cov\left( {\widehat{A}}_1, {\widehat{A}}_2\right) -2Cov\left( {\widehat{A}}_1, {\widehat{A}}_3\right) - 2Cov\left( {\widehat{A}}_2, {\widehat{A}}_3\right) . \end{aligned}$$
(20)

The proof of Theorem 1 proceeds by analyzing the expectations and variances involved. It is available in the supplementary materials.

1.2 Sketch of the proof of Theorem 2

Using Lemma 2, in this case \({\widehat{A}}\) can be expressed as: \({\widehat{A}} ={\widehat{A}}_1^* +{\widehat{A}}_2^* - {\widehat{A}}_3^* - {\widehat{A}}_4^* + {\widehat{A}}_5^*,\) where

$$\begin{aligned} {\widehat{A}}_1^*:= & {} \dfrac{1}{N}\displaystyle \sum _{i=1}^{N}{\dfrac{(K_h*f)(Y_i)}{(K_b*g)(Y_i)}(v(Y_i)-\mu _v)}, \\ {\widehat{A}}_2^*:= & {} \dfrac{1}{N}\displaystyle \sum _{i=1}^{N}{\dfrac{{\hat{f}}_h(Y_i)-(K_h*f)(Y_i)}{(K_b*g)(Y_i)}(v(Y_i)-\mu _v)}, \\ {\widehat{A}}_3^*:= & {} \dfrac{1}{N}\displaystyle \sum _{i=1}^{N}{\dfrac{(K_h*f)(Y_i) ( {\hat{g}}_b(Y_i)-(K_b*g)(Y_i) )}{(K_b*g)(Y_i)^2}(v(Y_i)-\mu _v)}, \\ {\widehat{A}}_4^*:= & {} \dfrac{1}{N}\displaystyle \sum _{i=1}^{N}{\dfrac{({\hat{f}}_h(Y_i)-(K_h*f)(Y_i)) ( {\hat{g}}_b(Y_i)-(K_b*g)(Y_i) )}{(K_b*g)(Y_i)^2}(v(Y_i)-\mu _v)}, \\ {\widehat{A}}_5^*:= & {} \dfrac{1}{N}\displaystyle \sum _{i=1}^{N}{\dfrac{{\hat{f}}_h(Y_i)}{{\hat{g}}_b(Y_i)}\left( \dfrac{{\hat{g}}_b(Y_i)-(K_b*g)(Y_i)}{(K_b*g)(Y_i)}\right) ^2(v(Y_i)-\mu _v)}, \end{aligned}$$

being \({\widehat{A}}_4^*\) and \({\widehat{A}}_5^*\) negligible terms. Thus we will consider \({\widehat{A}}^* :={\widehat{A}}_1^* +{\widehat{A}}_2^* - {\widehat{A}}_3^*. \)

The proof of Theorem 2 follows parallel lines to that of Theorem 1. It is available in the supplementary materials.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Borrajo, L., Cao, R. Nonparametric estimation for big-but-biased data. TEST 30, 861–883 (2021). https://doi.org/10.1007/s11749-020-00749-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11749-020-00749-5

Keywords

Mathematics Subject Classification

Navigation