Skip to main content
Log in

Kernel based estimation of the distribution function for length biased data

  • Published:
Metrika Aims and scope Submit manuscript

Abstract

Empirical and kernel estimators are considered for the distribution of positive length biased data. Their asymptotic bias, variance and limiting distribution are obtained. For the kernel estimator, the asymptotically optimal bandwidth is calculated and rule of thumb bandwidths are proposed. At any point below the median, the asymptotic mean squared error of the kernel estimator is smaller than that of the empirical estimator. A suitably truncated kernel estimator is positive and we prove the strong uniform, and \(L_2\) consistency of this estimator. Simulations reveal the improved performance of the truncated kernel estimator in estimating tail probabilities based on length biased data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

Notes

  1. A non-negative valued r.v. is said to follow log-normal\((\mu ,\ \sigma ^2)\) distribution iff the natural logarithm of the random variable follows normal distribution with mean \(\mu \) and variance \(\sigma ^2\).

References

  • Altman N, Léger C (1995) Bandwidth selection for kernel distribution function estimation. J Stat Plan Inference 46:195–214

    Article  MathSciNet  Google Scholar 

  • Azzalini A (1981) Estimation of a distribution function and quantiles by a kernel method. Biometrika 68(1):326–328

    Article  MathSciNet  Google Scholar 

  • Ajami M, Fakoor V, Jomhoori S (2013) Some asymptotic results of kernel density estimator in length-biased sampling. J Sci Islam 24(1):55–62

    MathSciNet  Google Scholar 

  • Borrajo MI, González-Manteiga W, Martinez-Miranda MD (2017) Bandwidth Selection for kernel density estimation with length biased data. J Nonparametr Stat 29(3):636–638. https://doi.org/10.1080/10485252.2017.1339309

    Article  MathSciNet  MATH  Google Scholar 

  • Blumenthal S (1967) Proportional sampling in life length studies. Technometrics 9:205–218

    Article  MathSciNet  Google Scholar 

  • Chow YS, Teicher H (1997) Probability theory, independence, interchangeability, martingales, 3rd edn. Springer, Berlin

    MATH  Google Scholar 

  • Cox DR (1969) Some sampling problems in technology. In: Johnson NL, Smith H (eds) New developments in survey sampling. Wiley, New York, pp 506–27

    Google Scholar 

  • Del Río AQ, Estévez-Pŕez G (2012) Nonparametric kernel distribution function estimation with kerdiest: an R package for bandwidth choice and applications. J Stat Softw 50(8):1–21

    Google Scholar 

  • Dutta S (2015) Local smoothing for kernel distribution function estimation. Commun Stat Simul Comput 44:878–891

    Article  MathSciNet  Google Scholar 

  • Efromovich S (2018) Missing and modified data in nonparametric estimation, with R examples. Chapman & Hall, London

    Book  Google Scholar 

  • Falk FY (1983) Relative efficiency and deficiency of kernel type estimators of distribution functions. Stat Neerl 37(2):73–83

    Article  MathSciNet  Google Scholar 

  • Gill RD, Vardi Y, Wellner JA (1988) Large sample theory of empirical distributions in biased sampling models. Ann Statist 16(3):1069–1112

    Article  MathSciNet  Google Scholar 

  • Horváh L (1985) Estimation from a length-biased distribution. Stat Decis 3:91–113

    MathSciNet  Google Scholar 

  • Huang CY, Qin J (2011) Nonparametric estimation for length-biased and right-censored data. Biometrika 98(1):177–186

    Article  MathSciNet  Google Scholar 

  • Jones MC (1991) Kernel density estimation for length biased data. Biometrika 78(3):511–519

    Article  MathSciNet  Google Scholar 

  • Liu R, Lijian Y (2008) Kernel estimation of multivariate cumulative distribution function. J Nonparametr Stat 20(8):661–677. https://doi.org/10.1080/10485250802326391

    Article  MathSciNet  MATH  Google Scholar 

  • McFadden JA (1962) On the lengths of intervals in a stationary point process. J R Soc Ser B 24:364–382

    MathSciNet  MATH  Google Scholar 

  • Patil GP, Rao CR (1977) The weighted distributions: a survey of their applications. In: Krishnaiah PR (ed) Applications of statistics. North-Holland, Amsterdam, pp 383–405

    Google Scholar 

  • Patil GP, Rao CR (1978) Weighted distributions and size-biased sampling with applications to wildlife populations and human families. Biometrics 34:179–189

    Article  MathSciNet  Google Scholar 

  • Qin J (2017) Biased sampling. Over-identified parameter problems and beyond, Springer, Singapore

  • Reiss RD (1981) Nonparametric estimation of smooth distribution functions. Scand J Stat 8(2):116–119

    MathSciNet  MATH  Google Scholar 

  • Serfling RJ (1980) Approximation theorems of mathematical statistics. Wiley, New York

    Book  Google Scholar 

  • Swanepoel JWH, Graan FCV (2005) A new kernel distribution function estimator based on a non-parametric transformation of the data. Scand Stat Theory Appl 32(4):551–562

    Article  MathSciNet  Google Scholar 

  • Vardi Y (1982a) Nonparametric estimation in the presence of length bias. Ann Stat 10(2):616–620

    Article  MathSciNet  Google Scholar 

  • Vardi V (1982b) Nonparametric estimation in renewal process. Ann Stat 10(3):772–785

Download references

Acknowledgements

We are deeply thankful to the referee for pointing out several mistakes and for offering constructive suggestions for improvement.

Author information

Authors and Affiliations

Authors

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A. Bose: Research supported by J.C. Bose National Fellowship, Govt. of India. S. Dutta: Research supported by the MATRICS scheme NO. MTR/2019/000502 of Science and Engineering Research Board (SERB), Govt. of India.

Appendix

Appendix

Proof

We can write

$$\begin{aligned} \frac{N}{D}=\frac{N}{E(D)(1-r)},\quad \text {where}\ 1-r=\frac{D}{E(D)}. \end{aligned}$$

Under the stated assumptions \(P(D=E(D))=0\), and therefore \(P(r\ne 1)=1\). We know that for \(r\ne 1\),

$$\begin{aligned} \frac{1}{1-r}=1+r+\frac{r^2}{(1-r)}.\end{aligned}$$
(6.1)

It is easy to verify that

$$\begin{aligned} E\left[ \left( 1+r+\frac{r^2}{1-r}\right) \frac{N}{E(D)}\right]= & {} \frac{E(N)}{E(D)}-\frac{1}{E^2(D)}Cov(N,D)\nonumber \\&+\frac{1}{E^2(D)}E\left[ \frac{N}{D}(D-E(D))^2\right] .\end{aligned}$$
(6.2)

Multiplying both sides of (6.1) by \(\frac{N}{E(D)}\), taking expectation and using (6.2) we get the stated inequality. \(\square \)

Proof of Lemma 2

To prove Lemma 2 it is enough to see that

$$\begin{aligned}&E\left[ \frac{1}{Y^2_1}I(Y_1\le y)\right] =\frac{1}{\mu }\int ^y_0 \frac{f(z)}{z}dz,\ \ E\left[ \frac{1}{Y_1}I(Y_1\le y)\right] \nonumber \\&\quad =\frac{1}{\mu }\int ^y_0f(z)dz=\frac{F(y)}{\mu }. \end{aligned}$$
(6.3)
$$\begin{aligned} \text {and}&\ \ E\left[ \frac{1}{Y^2_1}I(Y_1\le y)\right] =\frac{1}{\mu }\int ^y_0 \frac{f(z)}{z}dz.\end{aligned}$$
(6.4)

Using (6.3) and (6.4), it is straight forward to verify the expressions of \(Var\left[ \frac{1}{Y_1}I(Y_1\le y)\right] ,\ Cov\left[ \frac{1}{Y_1}I(Y_1\le y),\frac{1}{Y_1}\right] \) in Lemma 2. \(\square \)

Proof of Lemma 3

$$\begin{aligned}&E\left[ \frac{1}{Y^2_1}K^2\left( \frac{y-Y_1}{h}\right) \right] \nonumber \\&\quad = \frac{1}{\mu }\int ^\infty _0\nonumber K^2\left( \frac{y-z}{h}\right) \frac{f(z)}{z}dz=\frac{h}{\mu }\int ^{\frac{y}{h}}_{-\infty }K^2(v)\frac{f(y-vh)}{y-vh}dv\nonumber \\&\quad =\frac{h}{y\mu }\int ^{\frac{y}{h}}_{-\infty }K^2(v)f\left( y-vh\right) dv+ \frac{h^2}{\mu }\int ^{y/h}_{-\infty }vK^2(v)\frac{f(y-vh)}{y(y-vh)}dv\nonumber \\&\quad = \frac{h}{y\mu }\int ^{\frac{y}{h}}_{-\infty }K^2(v)f\left( y-vh\right) dv+O\left( h^2\right) \ (\text {Assumptions (A1)-(A3)})\nonumber \\&\quad =-\frac{1}{y\mu }\int ^{\frac{y}{h}}_{-\infty }K^2(v)dF\left( y-vh\right) +O\left( h^2\right) .\end{aligned}$$
(6.5)

Using integration by parts we get

$$\begin{aligned}&\int ^{\frac{y}{h}}_{-\infty }K^2(v)dF\left( y-vh\right) +\int ^{\frac{y}{h}}_{-\infty }F(y-hv)dK^2(v)\nonumber \\&\quad = F(0)K^2(y/h)-F(\infty )K^2(-\infty ).\nonumber \\&\quad \Rightarrow -\int ^{\frac{y}{h}}_{-\infty }K^2(v)dF\left( y-vh\right) \\&\quad = \int ^{\frac{y}{h}}_{-\infty }F(y-hv)dK^2(v)\ (\text {as}\ K(-\infty )=0 \ \text {and}\ F(0)=0) \end{aligned}$$

Therefore, as \(n\rightarrow \infty \)

$$\begin{aligned} -\int ^{\frac{y}{h}}_{-\infty }K^2(v)dF\left( y-vh\right)= & {} F(y)K^2\left( \frac{y}{h}\right) -hf(y)\int ^{\frac{y}{h}}_{-\infty }vdK^2(v) + O(h^2),\nonumber \\= & {} F(y)-hf(y)\int ^1_{-1}vdK^2(v)+ O(h^2) (\text {as}\ K(1)=1).\nonumber \\ \end{aligned}$$
(6.6)

From (6.5) and (6.6) we get

$$\begin{aligned} E\left[ \frac{1}{Y^2_1}K^2\left( \frac{y-Y_1}{h}\right) \right] =\frac{1}{y\mu }[F(y)-hf(y)\int ^1_{-1}vdK^2(v)]+O\left( h^2\right) \ \text {as}\ n\rightarrow \infty .\end{aligned}$$
(6.7)

Further under Assumptions (A1) to (A3) we get

$$\begin{aligned} E\left[ \frac{1}{Y_1}K\left( \frac{y-Y_1}{h}\right) \right] =-\frac{1}{\mu }\int ^{\frac{y}{h}}_{-\infty }K(v)dF(y-vh)=\frac{F(y)}{\mu }+O\left( h^2\right) \text {as}\ n\rightarrow \infty .\end{aligned}$$
(6.8)

Equations (6.7) and (6.8) imply that

$$\begin{aligned} Var(N)= & {} \frac{1}{n}Var\left[ \frac{1}{Y_1}K\left( \frac{y-Y_1}{h}\right) \right] \\= & {} \frac{1}{n\mu }\left[ \frac{F(y)}{y}-\frac{hf(y)}{y}\int ^1_{-1}vdK^2(v)-\frac{F^2(y)}{\mu }\right] \\&+O\left( \frac{h^2}{n}\right) \text {as}\ n\rightarrow \infty .\end{aligned}$$

\(\square \)

Using similar arguments we get

$$\begin{aligned} E\left[ \frac{1}{Y^2_1}K\left( \frac{y-Y_1}{h}\right) \right]= & {} \frac{1}{y\mu }[F(y)-hf(y)\int ^1_{-1}vdK(v)]+O\left( h^2\right) \nonumber \\= & {} \frac{F(y)}{y\mu }+O\left( h^2\right) \quad \text {as}\ n\rightarrow \infty .\end{aligned}$$
(6.9)

Equations (6.8) and (6.9) imply that

$$\begin{aligned} Cov(N,D)= & {} \frac{1}{n}Cov\left( \frac{1}{Y_1}K\left( \frac{y-Y_1}{h}\right) ,\frac{1}{Y_1}\right) \\= & {} \frac{F(y)}{n\mu }\left[ \frac{1}{y}-\frac{1}{\mu }\right] +O\left( \frac{h^2}{n}\right) \ \text {as}\ n\rightarrow \infty . \end{aligned}$$

\(\square \)

Proof of Lemma 4

$$\begin{aligned} \hat{\sigma }^2_E= & {} \frac{1}{\frac{1}{n}\left( \sum ^n_{i=1}\frac{1}{Y_i}\right) ^2} \sum ^n_{i=1}\frac{1}{Y^2_i}\left[ I(Y_i\le y)-\hat{F}_n(y)\right] ^2\\= & {} \frac{1}{\left( \frac{1}{n}\sum ^n_{i=1}\frac{1}{Y_i}\right) ^2}\left[ \frac{1}{n}\sum ^n_{i=1}\frac{I(Y_i\le y)}{Y^2_i}+[\hat{F}_n(y)]^2\frac{1}{n}\sum ^n_{i=1}\frac{1}{Y^2_i} -2\hat{F}_n(y)\sum ^n_{i=1}\frac{I(Y_i\le y)}{Y^2_i}\right] \end{aligned}$$

As \(n\rightarrow \infty \), under Assumption (A2),

$$\begin{aligned}&\frac{1}{n}\sum ^n_{i=1}\frac{1}{Y_i}\rightarrow E\left( \frac{1}{Y_1}\right) =\frac{1}{\mu },\ \frac{1}{n}\sum ^n_{i=1}\frac{I(Y_i\le y)}{Y^2_i}\rightarrow \frac{1}{\mu }\int ^y_0 \frac{f(z)}{z}dz, \ \text {almost surely} \end{aligned}$$
(6.10)
$$\begin{aligned}&\frac{1}{n}\sum ^n_{i=1}\frac{1}{Y^2_i}\rightarrow \frac{1}{\mu }\int ^\infty _0 \frac{f(z)}{z}dz\ \text {and}\ \hat{F}_n(y)\rightarrow F(y),\ \text {almost surely}. \end{aligned}$$
(6.11)

Therefore under Assumption (A2), from (6.10) and (6.11) we see that as \(n\rightarrow \infty \),

$$\begin{aligned} \hat{\sigma }^2_E \rightarrow&\mu \left[ \int ^y_0 \frac{f(z)}{z}dz + [F(y)]^2\int ^\infty _0 \frac{f(z)}{z}dz-2F(y)\int ^y_0 \frac{f(z)}{z}dz\right] , \text {almost surely}\\&= \mu \int ^\infty _0 \frac{f(z)}{z}\left[ I(z\le y)-F(y)\right] ^2= {\sigma }^2_E.\end{aligned}$$

Hence, \(\frac{\hat{\sigma }^2_E}{{\sigma }^2_E}\rightarrow 1\), almost surely, as \(n\rightarrow \infty \).

Let \(\hat{f}_n(y)\) be a strongly consistent estimator of f(y). Using (6.10) and (6.11), we see that under the stated conditions (A1) to (A3) as \(n\rightarrow \infty \)

$$\begin{aligned} \hat{\sigma }^2_K= & {} \hat{\mu }\left[ \frac{\hat{F}_n(y)}{y}+\hat{F}^2_n(y)\frac{\hat{\mu }}{n}\sum ^n_{i=1}\frac{1}{Y^2_i}-\frac{2\hat{F}^2_n(y)}{y} -\frac{h\hat{f}_n(y)}{y}\int ^1_{-1}vdK^2(v)\right] \\\rightarrow & {} \mu \left[ \frac{F(y)}{y}+F^2(y)\int ^\infty _0 \frac{f(z)}{z}dz-\frac{2F^2(y)}{y}-\frac{hf(y)}{y}\int ^1_{-1}vdK^2(v)\right] ,\ \text {almost surely}.\end{aligned}$$

From Theorem (3) we recall that as \(n\rightarrow \infty \)

$$\begin{aligned} \sigma ^2_K=\mu \left[ \frac{F(y)}{y}+F^2(y)\int ^\infty _0 \frac{f(z)}{z}dz-\frac{2F^2(y)}{y}-\frac{hf(y)}{y}\int ^1_{-1}vdK^2(v)\right] +O(h^2). \end{aligned}$$

Therefore under the stated assumptions \(\frac{\hat{\sigma }^2_K}{\sigma ^2_K}\rightarrow 1,\) almost surely, as \(n\rightarrow \infty \). \(\square \)

Proof of Lemma 5

The asmyptotic mean squared error AMSE(h) of \(\hat{F}_h(y)\) is sum of the asymptotic variance and square of the asymptotic bias. The expressions for the asymptotic bias and variance are obtained from the leading terms in Theorem 1 (ii) and \(\sigma ^2_K\) in Theorem 3.

Proof of Lemma 7

Since \(||\hat{F}_{C,\ \hat{h}}-F||\le ||\hat{F}_{C,\ \hat{h}}-\hat{F}_{\hat{h}}||+||\hat{F}_{\hat{h}}-F||\) and under the stated conditions \(||\hat{F}_{\hat{h}}-F||\rightarrow 0\), almost surely, as \(n\rightarrow \infty \) (see Lemma 6), therefore it is enough to prove \(||\hat{F}_{C,\ \hat{h}}-\hat{F}_{\hat{h}}||\rightarrow 0\), almost surely, as \(n\rightarrow \infty \).

$$\begin{aligned} \hat{F}_{C,\ \hat{h}}(y)-\hat{F}_{\hat{h}}(y)={\left\{ \begin{array}{ll}-\hat{F}_{\hat{h}}(0)\frac{(1-\hat{F}_{\hat{h}}(y))}{1-\hat{F}_{\hat{h}}(0)},\ y>0\\ -\hat{F}_{\hat{h}}(y),\ y\le 0.\end{array}\right. }\end{aligned}$$

Therefore

$$\begin{aligned} ||\hat{F}_{C,\ \hat{h}}-\hat{F}_{\hat{h}}||\le \hat{F}_{\hat{h}}(0). \end{aligned}$$

But, \(\hat{F}_{\hat{h}}(0)\rightarrow F(0)=0\) almost surely, as \(n\rightarrow \infty \) (see Lemma 6). Consequently, \(||\hat{F}_{C,\ \hat{h}}-\hat{F}_{\hat{h}}||\rightarrow 0\), almost surely, as \(n\rightarrow \infty \). This completes the proof of the first part.

Since \(\hat{F}_{C,\ \hat{h}}\) and F are both dfs, \(||\hat{F}_{\hat{h}}-F||\le 1\) almost surely. Therefore using the almost sure convergence of \(||\hat{F}_{C,\ \hat{h}}-F||\) and the dominated convergence theorem we see that \(E||\hat{F}_{C,\ \hat{h}}-F||^2\rightarrow 0\) as \(n\rightarrow \infty \). This completes the proof. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bose, A., Dutta, S. Kernel based estimation of the distribution function for length biased data. Metrika 85, 269–287 (2022). https://doi.org/10.1007/s00184-021-00824-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00184-021-00824-3

Keywords

Navigation