A test for the geometric distribution based on linear regression of order statistics

doi:10.1016/j.matcom.2020.08.023

Mathematics and Computers in Simulation

Volume 186, August 2021, Pages 103-123

https://doi.org/10.1016/j.matcom.2020.08.023 Get rights and content

Abstract

This paper proposes and studies a novel test for the geometric distribution which is based on a characterization of that law in terms of the conditional expectation of the second order statistic, given the value of the first order statistic. The asymptotic null distribution of the test statistic and its limit under general conditions are derived, proving that it is consistent against fixed alternatives. It can also detect alternatives converging to the null at the rate $n^{- 1 ∕ 2}$ , $n$ denoting the sample size. A weighted bootstrap and a parametric bootstrap can be used to consistently estimate the null distribution. The finite sample performance of these two bootstrap approximations is assessed via simulation. The power of the new test is numerically compared with that of some existing tests, concluding that the proposal presents a competitive behavior.

Introduction

The geometric distribution is a count model with applications in many research areas such as lifetime analysis, as the discrete counterpart of the exponential law, in capture–recapture methods, where it emerges as a Poisson mixing distribution (see, e.g. [2], [26]), among others. Let $X$ be a random variable taking positive integer values, $X \in N$ , with probability law $P (X = j) = q^{j - 1} p, \forall j \geq 1,$ where $q = 1 - p$ , for some $p \in (0, 1)$ , then we say that $X$ has a geometric distribution with parameter $p = 1 - q$ and write $X \sim G e o (p)$ .

Testing the goodness-of-fit (gof) of given observations with a probabilistic model is a crucial aspect of data analysis. Let $X \in N$ . In this paper we consider the problem of testing $H_{0} : X \sim G e o (p), for some p \in (0, 1),$ against the general alternative $H_{1} : X ≁ G e o (p), \forall p \in (0, 1) .$ Pearson’s $χ^{2}$ test is commonly used for this testing problem. This test has the nice property that its test statistic is asymptotically distribution-free under the null hypothesis, provided that estimation of parameters is done properly. Its practical application presents two main problems: first, cell selection is not a clear-cut task; and the goodness of the $χ^{2}$ approximation to the null distribution requires rather large sample sizes (see, e.g. [1] for this point in another testing framework). In addition, this test is not consistent against all alternatives. The smooth test numerically studied in [3] shares the same shortcoming. The general tests proposed in [14], [17], [22], [30] can be applied to testing $H_{0}$ and all of them are consistent against fixed alternatives. A diagnostic tool, the ratio plot, has been investigated in [6] (see also [2], [5]) to graphically check $H_{0}$ . [12] proposed a gof test that can be applied to any discrete law having the power series distribution. In particular, it can be applied to testing $H_{0}$ . Nevertheless, its practical application presents some difficulties (specifically, the tabulation of all “arrangement” for each possible value of $t$ , using the nomenclature in that paper).

This paper proposes and studies a novel gof test of the geometric distribution. The test is based on a characterization of that distribution introduced in [24], in terms of the conditional expectation of the second order statistic, given the value of the first order statistic, which is linear if and only if the law is geometric. Then, using the well-known Bierens [4] characterization of conditional moments, a test statistic is proposed in Section 2. Section 3 derives the almost sure limit of the test statistic, its asymptotic null distribution and its distribution under contiguous alternatives. It is concluded that the test that rejects for “large” values of the test statistic is able to detect any fixed alternative and detects alternatives converging to the null at the rate $1 ∕ \sqrt{n}$ . Since the asymptotic null distribution of the test statistic depends on unknown parameters, it cannot be used to approximate its null distribution. Section 4 studies two null distribution estimators, a weighted bootstrap and a parametric bootstrap, which are proven to yield consistent null distribution estimators. Section 5 summarizes the results of a simulation study, designed to assess the finite sample performance of the test when the null distribution is estimated by using the methods studied in Section 4, and to compare it with some competitors. The simulation results reveal that the new test has a very competitive performance, and hence it deserves to be included in any battery of gof tests for the geometric distribution. Finally, we applied the new test to a real data set. The proofs are deferred to Section 6. Section 7 displays the R code used to calculate the new test statistic. All limits in this paper are taken when $n \to \infty$ , where $n$ denotes the sample size.

Section snippets

The test statistic

Let $X_{1}, \dots, X_{n}$ be independent and identically distributed (iid) discrete random variables taking positive integer values, with $P (X_{1} = j) = p_{j} > 0$ , for $j \geq 1$ . Let $X_{1 : n} \leq \dots \leq X_{n : n}$ denote the order statistics. Theorem 3 of [24] shows that $E (X_{2 : n} | X_{1 : n} = j) = j + a$ , $\forall j \geq 1$ , and certain $a > 0$ , if and only if the probability mass function of $X_{1}$ satisfies (1.1) for $q \in (0, 1)$ being the solution of the equation $\frac{n}{a} = \frac{1 - q^{n}}{1 - q} \frac{1 - q^{n - 1}}{q^{n - 1}} .$ For $n = 2$ and denoting $M = min {X_{1}, X_{2}} = X_{1 : 2}$ and $D = | X_{1} - X_{2} | = X_{2 : 2} - X_{1 : 2}$ , the above characterization can be

Asymptotic properties

We first calculate the limit of the proposed test statistic under general distributional assumptions.

Theorem 4

Let $X_{1}, \dots, X_{n}$ be iid from $X$ , a random variable taking values in $N$ such that $1 < E (X) < \infty$ , then $T_{n} \overset{a . s .}{⟶} τ_{X} = \int {| S_{X} (t) |}^{2} w (t) d t,$ where $S_{X} (t) = E \{(D - a_{X}) e^{i t M}\}$ , $a_{X} = 2 \frac{E (X) {E (X) - 1}}{2 E (X) - 1}$ .

Notice that $τ_{X} \geq 0$ . Under the null hypothesis we have that $τ_{X} = 0$ . Moreover, since the weight function is positive, we have that $τ_{X} = 0$ if and only if $H_{0}$ is true. Therefore, as intuitively stated in Section 2, a reasonable test should

Approximating the null distribution

This section studies two estimators of the null distribution of $T_{n}$ : a weighted bootstrap estimator and a parametric bootstrap estimator.

Numerical results

The results so far stated are asymptotic, that is, they are valid for large sample sizes. With the aim of studying the finite sample size performance of the proposed test, we carried out some simulation studies. Section 5.1 summarizes the outcomes of two experiments designed to compare the approaches in Section 4 to approximate the null distribution of $T_{n}$ , that is, for the level. Section 5.2 reports the results of comparing the proposal in this paper with other existing tests of $H_{0}$ in terms of

Proofs

Along this section, $C$ is a generic positive constant taking many different values throughout the proofs.

Proof of Theorem 4

We have that $S_{n} (t) = S_{1 n} (t) + S_{2 n} (t),$ with $S_{1 n} (t) = \frac{1}{n (n - 1)} \sum_{1 \leq j \neq k \leq n} (D_{j k} - a) \{cos (t M_{j k}) + sin (t M_{j k})\},$ $S_{2 n} (t) = \frac{a - \hat{a}}{n (n - 1)} \sum_{1 \leq j \neq k \leq n} \{cos (t M_{j k}) + sin (t M_{j k})\},$ and $a$ is as defined in (2.1). From the strong law of large numbers for $U$ -statistics (see, e.g. [31]), $S_{1 n} (t) \overset{a . s .}{⟶} S_{X} (t), \forall t \in R .$ Since $| cos (t M_{j k}) | \leq 1$ and $| sin (t M_{j k}) \leq 1$ , $\forall t \in R$ , and $0 \leq D_{j k} \leq X_{k} + X_{j}$ , $1 \leq j \neq k \leq n$ , it follows that $| S_{1 n} (t) | \leq 4 \bar{X} + 2 a$ , $\forall t \in R$ . From the SLLN, $\bar{X} \overset{a . s .}{⟶} E (X) < \infty$ .

R code for the exact calculation of $T_{n}$

This section displays the function we wrote for the exact calculation of the test statistic $T_{n}$ , with weight function $w$ the probability density function of a normal law with mean 0 and variance $β$ . The inputs are x $=$ vector containing the data, and beta $=$ variance of the weight function.

Acknowledgments

The authors thank two anonymous referees for their constructive comments and suggestions which helped to improve the presentation. The research in this paper has been partially funded by grants: CTM2015–68276–R of the Spanish Ministry of Economy and Competitiveness (M.V. Alba-Fernández) and MTM2017-89422-P of the Spanish Ministry of Economy, Industry and Competitiveness , ERDF support included (M.D. Jiménez-Gamero).

References (35)

BierensH.J.
Consistent model specification tests
J. Econometrics
(1982)
BurkeM.D.
Multivariate tests-of-fit and uniform confidence bands using a weighted bootstrap
Statist. Probab. Lett.
(2000)
GürtlerN. et al.
Recent and classical goodness-of-fit tests for the Poisson distribution
J. Statist. Plann. Inference
(2000)
Jiménez-GameroM.D. et al.
Goodness-of-fit tests based on empirical characteristic functions
Comput. Statist. Data Anal.
(2009)
Jiménez-GameroM.D. et al.
Bootstrapping parameter estimated degenerate U and V statistics
Statist. Probab. Lett.
(2003)
Jiménez-GameroM.D. et al.
Fast goodness-of-fit tests based on the characteristic function
Comput. Stat. Data Anal.
(2015)
Alba FernándezM.V. et al.
Bootstrapping divergence statistics for testing homogeneity in multinomial populations
Math. Comput. Simul.
(2009)
AnanO. et al.
On the turing estimator in capture-recapture count data under the geometric distribution
Metrika
(2019)
BestD.J. et al.
Tests of fit for the geometric distribution
Commun. Stat. Simul. C
(2003)
BöhningD. et al.
Use of the ratio plot in capture–recapture estimation
J. Comput. Graph. Stat.
(2013)

BöhningD. et al.

The geometric distribution, the ratio plot under the null and the burden of dengue fever in chiang mai province

ChenX. et al.

Central limit and functional central limit theorems for Hilbert-valued dependent heterogeneous arrays with applications

Econom. Theory

(1998)

DelhingH. et al.

Random quadratic forms and the bootstrap for $U$ -statistics

J. Multivariate Anal.

(1994)

EscancianoJ.C.

Goodness–of–fit tests for linear and nonlinear time series models

J. Amer. Statist. Assoc.

(2006)

GiacominiR. et al.

A warp-speed method for conducting Monte Carlo experiments involving bootstrap estimators

Econometric Theory

(2013)

González-BarriosJ.M. et al.

Goodness of fit for discrete random variables using the conditional density

Metrika

(2006)

HenzeN.

Empirical-distribution-function goodness-of-fit tests for discrete models

Can. J. Stat.

(1996)

Cited by (6)

Investigating the predictability of crashes on different freeway segments using the real-time crash risk models
2021, Accident Analysis and Prevention
Citation Excerpt :
Crash prediction trails are independent of each other, and the probability of success is the same for each trial, which means that crash prediction behaviors follow a geometric distribution. According to geometric distribution, the expectation is given by the inverse of the probability (Jiménez-Gamero and Alba-Fernández, 2021). Thus, the reciprocal of P(A|A') (ROP) can be explained as the actual amount of required forecasts prior to a crash.
Improvement of the prediction efficiency of crash risks has attracted the attention of numerous studies. Nevertheless, one of the most important factors, crash precursors, were neglected. This study mainly focuses on identifying optimal crash precursors for different freeway section types, as well as providing a threshold selection method for real-time crash risk models. Freeway sections are divided into four types, i.e. basic sections, weaving areas, merging areas, and diverging areas. Bayesian logistic regression (BLR) models were established for each type of segment, and significant factors were distinguished. A threshold selection method was proposed based on cost-benefit theory, and the threshold is determined as the value when the number of proactive safety interventions to prevent a crash is 5000 in this study. BLR models with one, two and three optimal variables were developed. Then the sensitivity and false alarm rate of the models were obtained and compared. Comparison results show that the minimum amount of parameters which can achieve the ideal prediction effectiveness is two. In this situation, 25 %, 50 %, 20 % and 20 % of the crashes occurring at basic sections, weaving areas, merging areas and diverging areas can be accurately predicted respectively. Downstream average speed was recommended as the best crash precursor variable for all the segment types. Support Vector Machine and Random Forest were applied to confirm the conclusion. The conclusion of this paper has the possibility to help reduce crash risk to a relatively economical level in practical applications.
On Goodness-of-Fit Tests for the Neyman Type A Distribution
2023, REVSTAT-Statistical Journal
Goodness-of-fit test for count distributions with finite second moment
2023, Journal of Nonparametric Statistics
Goodness-of-fit test for count distributions with finite second moment
2021, arXiv
Goodness-of-fit test for count distributions with finite second moment
2021, arXiv
Quantifying the ratio-plot for the geometric distribution
2021, Journal of Statistical Computation and Simulation

View full text

Original articlesA test for the geometric distribution based on linear regression of order statistics

Abstract

Introduction

Section snippets

The test statistic

Asymptotic properties

Approximating the null distribution

Numerical results

Proofs

R code for the exact calculation of Tn

Acknowledgments

J. Econometrics

Statist. Probab. Lett.

J. Statist. Plann. Inference

Comput. Statist. Data Anal.

Statist. Probab. Lett.

Comput. Stat. Data Anal.

Bootstrapping divergence statistics for testing homogeneity in multinomial populations

Math. Comput. Simul.

On the turing estimator in capture-recapture count data under the geometric distribution

Metrika

Tests of fit for the geometric distribution

Commun. Stat. Simul. C

Use of the ratio plot in capture–recapture estimation

J. Comput. Graph. Stat.

The geometric distribution, the ratio plot under the null and the burden of dengue fever in chiang mai province

Central limit and functional central limit theorems for Hilbert-valued dependent heterogeneous arrays with applications

Econom. Theory

Random quadratic forms and the bootstrap for U-statistics

J. Multivariate Anal.

Goodness–of–fit tests for linear and nonlinear time series models

J. Amer. Statist. Assoc.

A warp-speed method for conducting Monte Carlo experiments involving bootstrap estimators

Econometric Theory

Goodness of fit for discrete random variables using the conditional density

Metrika

Empirical-distribution-function goodness-of-fit tests for discrete models

Can. J. Stat.

Original articles
A test for the geometric distribution based on linear regression of order statistics

R code for the exact calculation of $T_{n}$

Random quadratic forms and the bootstrap for $U$ -statistics