Adaptive nonparametric estimation of a component density in a two-class mixture model

doi:10.1016/j.jspi.2021.05.004

Journal of Statistical Planning and Inference

Volume 216, January 2022, Pages 51-69

https://doi.org/10.1016/j.jspi.2021.05.004 Get rights and content

Abstract

A two-class mixture model, where the density of one of the components is known, is considered. We address the issue of the nonparametric adaptive estimation of the unknown probability density of the second component. We propose a randomly weighted kernel estimator with a fully data-driven bandwidth selection method, in the spirit of the Goldenshluger and Lepski method. An oracle-type inequality for the pointwise quadratic risk is derived as well as convergence rates over Hölder smoothness classes. The theoretical results are illustrated by numerical simulations.

Introduction

The following mixture model with two components: $g (x) = θ + (1 - θ) f (x), \forall x \in [0, 1],$ where the mixing proportion $θ \in (0, 1)$ and the probability density function $f$ on $[0, 1]$ are unknown, is considered in this article. It is assumed that $n$ independent and identically distributed (i.i.d. in the sequel) random variables $X_{1}, \dots, X_{n}$ drawn from density $g$ are observed. The main goal is to construct an adaptive estimator of the nonparametric component $f$ and to provide non-asymptotic upper bounds of the pointwise risk: the resulting estimator should automatically adapt to the unknown smoothness of the target function. The challenge arises from the fact that there is no direct observation coming from $f$ . As an intermediate step, the estimation of the parametric component $θ$ is addressed as well.

Model (1) appears in some statistical settings: robust estimation and multiple testing among others. The one chosen in the present article, as described above, comes from the multiple testing framework, where a large number $n$ of independent hypotheses tests are performed simultaneously. $p$ -values $X_{1}, \dots, X_{n}$ generated by these tests can be modelled by (1). Indeed these are uniformly distributed on $[0, 1]$ under null hypotheses while their distribution under alternative hypotheses, corresponding to $f$ , is unknown. The unknown parameter $θ$ is the asymptotic proportion of true null hypotheses. It can be needed to estimate $f$ , especially to evaluate and control different types of expected errors of the testing procedure, which is a major issue in this context. See for instance Genovese and Wasserman (2002), Storey (2002), Langaas et al. (2005), Robin et al. (2007), Strimmer (2008), Nguyen and Matias (2014a), and more fundamentally, Benjamini and Hochberg (1995) and Efron et al. (2001).

In the setting of robust estimation, different from the multiple testing one, model (1) can be thought of as a contamination model, where the unknown distribution of interest $f$ is contaminated by the uniform distribution on $[0, 1]$ , with the proportion $θ$ . This is a very specific case of the Huber contamination model (Huber, 1965). The statistical task considered consists in robustly estimating $f$ from contaminated observations $X_{1}, \dots, X_{n}$ . But unlike our setting, the contamination distribution is not necessarily known while the contamination proportion $θ$ is assumed to be known, and the theoretical investigations aim at providing minimax rates as functions of both $n$ and $θ$ . See for instance the preprint of Liu and Gao (2019), which addresses pointwise estimation in this framework.

Back to the setting of multiple testing, the estimation of $f$ in model (1) has been addressed in several works. Langaas et al. (2005) proposed a Grenander density estimator for $f$ , based on a nonparametric maximum likelihood approach, under the assumption that $f$ belongs to the set of decreasing densities on $[0, 1]$ . Following a similar approach, Strimmer (2008) also proposed a modified Grenander strategy to estimate $f$ . However, the two aforementioned papers do not investigate theoretical features of the proposed estimators. Robin et al. (2007) and Nguyen and Matias (2014a) proposed a randomly weighted kernel estimator of $f$ , where the weights are estimators of the posterior probabilities of the mixture model, that is, the probabilities of each individual $i$ being in the nonparametric component given the observation $X_{i}$ . Robin et al. (2007) propose an EM-like algorithm, and prove the convergence to a unique solution of the iterative procedure, but they do not provide any asymptotic property of the estimator. Note that their model $g (x) = θ ϕ (x) + (1 - θ) f (x)$ , where $ϕ$ is a known density, is slightly more general, but our procedure is also suitable for this model under some assumptions on $ϕ$ . Besides, Nguyen and Matias (2014a) achieve a nonparametric rate of convergence $n^{- 2 β ∕ (2 β + 1)}$ for their estimator, where $β$ is the smoothness of the unknown density $f$ . However, their estimation procedure is not adaptive since the choice of their optimal bandwidth still depends on $β$ .

In the present work, a complete inference strategy for both $f$ and $θ$ is proposed. For the nonparametric component $f$ , a new randomly weighted kernel estimator is provided with a data-driven bandwidth selection rule. Theoretical results on the whole estimation procedure, especially adaptivity of the selection rule to unknown smoothness of $f$ , are proved under a given identifiability class of the model, which is an original contribution in this framework. Major results derived in this paper are the oracle-type inequality in Theorem 1, and the rates of convergence over Hölder classes, which are adapted to the control of pointwise risk of kernel estimators, in Corollary 1.

Unlike the usual approach in mixture models, the weights of the proposed estimator are not estimates of the posterior probabilities. The proposed alternative principle is simple and consists in using weights based on a density change, from the target distribution $f$ , which is not directly reachable, to the distribution of observed variables $g$ . A function $w$ is thus derived such that $f (x) = w (θ, g (x)) g (x)$ , for all $θ, x \in [0, 1]$ . This type of link between one of the conditional distribution given hidden variables, $f$ , to the distribution of observed variables $g$ , is quite remarkable in the framework of mixture models. It is a key idea of our approach, since it implies a crucial equation for controlling the bias term of the risk, see Section 2.1 for more details. This is necessary to investigate adaptivity using the Goldenshluger and Lespki (GL) approach (Goldenshluger and Lepski, 2011), which is known in other various contexts, see for instance, Comte et al. (2013), Comte and Lacour (2013), Doumic et al. (2012), Reynaud-Bouret et al. (2014) who apply GL method in kernel density estimation, and Bertin et al. (2016), Chagny (2013), Chichignoud et al. (2017) or Comte and Rebafka (2016).

Thus oracle weights are defined by $w (θ, g (X_{i}))$ , $i = 1, \dots, n$ , but $g$ and $θ$ are unknown. These oracle weights are estimated by plug-in, using preliminary estimators of $g$ and $θ$ , based on an additional sample $X_{n + 1}, \dots, X_{2 n}$ . Some assumptions on these estimators are needed to prove the results on the estimator of $f$ ; this paper also provides estimators of $g$ and $θ$ which satisfy these assumptions. Note that procedures of Nguyen and Matias (2014a) and Robin et al. (2007) actually require preliminary estimates of $g$ and $θ$ as well, but they do not deal with additional uncertainty caused by the multiple use of the same observations in the estimates of $θ$ , $g$ and $f$ .

Identifiability issues are reviewed in Section 1.1 in Nguyen and Matias (2014b). In the present work, $f$ is assumed to be vanishing at a neighbourhood of $1$ to ensure identifiability. Under this assumption, $θ$ can be recovered as the infimum of $g$ . Moreover, as shown above by the equation linking $f$ to $g$ and $θ$ , $f$ is actually uniquely determined by giving $g$ and $θ$ , even though the latter is not the infimum of $g$ . Note that the theoretical results on the estimator of the nonparametric component $f$ do not depend on the chosen identifiability class, and can be transposed to other cases. For that reason, the discussion on identifiability is postponed to Section 4.2, after results on the estimator of $f$ .

The paper is organized as follows. Our randomly weighted estimator of $f$ is constructed in Section 2.1. Assumptions on $f$ and on preliminary estimators of $g$ and $θ$ required for proving the theoretical results are in this section too. In Section 2, a bias–variance decomposition for the pointwise risk of the estimator of $f$ is given as well as the convergence rate of the kernel estimator with a fixed bandwidth. In Section 3, an oracle inequality is given, which justifies our adaptive estimation procedure. Construction of the preliminary estimators of $g$ and $θ$ is to be found in Section 4. Numerical results illustrate the theoretical results in Section 5. Proofs of theorems, propositions and technical lemmas are postponed to Section 6.

Section snippets

Collection of kernel estimators for the target density

In this section, a family of kernel estimators for the density function $f$ based on a sample ${(X_{i})}_{i = 1, \dots, n}$ of i.i.d. variables with distribution $g$ is defined. It is assumed that preliminary estimators of both the mixing proportion $θ$ and the mixture density $g$ are available, and respectively denoted by ${\tilde{θ}}_{n}$ and $\hat{g}$ . They are defined from an additional sample ${(X_{i})}_{i = n + 1, \dots, 2 n}$ of independent variables also drawn from $g$ but independent of the first sample ${(X_{i})}_{i = 1, \dots, n}$ . Definitions, results and results on

Adaptive pointwise estimation

Let $H_{n}$ be a finite family of possible bandwidths $h > 0$ , whose cardinality is bounded by the sample size $n$ . The best estimator in the collection ${({\hat{f}}_{h})}_{h \in H_{n}}$ defined in (3) at the point $x_{0}$ is the one that have the smallest risk, or similarly, the smallest bias–variance decomposition. But since $f$ is unknown, in practice it is impossible to minimize over $H_{n}$ the r.h.s. of inequality (7) in order to select the best estimate. Thus, we propose a data-driven selection, with a rule in the spirit of

Estimation of the mixture density $g$ and the mixing proportion $θ$

This section is devoted to the construction of the preliminary estimators $\hat{g}$ and ${\tilde{θ}}_{n}$ , required to build (3). To define them, we assume that we observe an additional sample ${(X_{i})}_{i = n + 1, \dots, 2 n}$ distributed with density function $g$ , but independent of the sample ${(X_{i})}_{i = 1, \dots, n}$ . We explain how estimators $\hat{g}$ and ${\tilde{θ}}_{n}$ can be defined to satisfy the assumptions described at the beginning of Section 2.2, and also how we compute them in practice. The reader should bear in mind that other constructions are

Simulated data

We briefly illustrate the performance of the estimation method over simulated data, according the following framework. We simulate observations with density $g$ defined by model (1) for sample size $n \in {500, 1000, 2000}$ . Three different cases of $(θ, f)$ are considered:

•
$f_{1} (x) = 4 {(1 - x)}^{3} 1_{[0, 1]} (x)$ , $θ_{1} = 0.65$ .
•
$f_{2} (x) = \frac{s}{1 - δ} {(1 - \frac{x}{1 - δ})}^{s - 1} 1_{[0, 1 - δ]} (x)$ with $(δ, s) = (0.3, 1.4)$ , $θ_{2} = 0.45$ .
•
$f_{3} (x) = λ e^{- λ x} {(1 - e^{- λ b})}^{- 1} 1_{[0, b]} (x)$ the density of truncated exponential distribution on $[0, b]$ with $(λ, b) = (10, 0.9)$ , $θ_{3} = 0.35$ .

The density $f_{1}$ is borrowed

Proofs

In the sequel, the notations $\tilde{P}$ , $\tilde{E}$ and $\tilde{V} a r$ respectively denote the probability, the expectation and the variance associated with $X_{1}, \dots, X_{n}$ , conditionally on the additional random sample $X_{n + 1}, \dots, X_{2 n}$ .

Acknowledgements

We are very grateful to Catherine Matias for interesting discussions on mixture models. The research of the authors is partly supported by the French Agence Nationale de la Recherche (ANR-18-CE40-0014 projet SMILES) and by the French Région Normandie (projet RIN AStERiCs 17B01101GR). Finally, we gratefully acknowledge the referees for carefully reading the manuscript and for numerous suggestions that improved the paper.

References (30)

CelisseAlain et al.
A cross-validation based estimation of the proportion of true null hypotheses
J. Statist. Plann. Inference
(2010)
ComteFabienne et al.
Nonparametric estimation for stochastic differential equations with random effects
Stochastic Process. Appl.
(2013)
ComteFabienne et al.
Nonparametric weighted estimators for biased data
J. Statist. Plann. Inference
(2016)
RobinStéphane et al.
A semi-parametric approach for mixture models: Application to local false discovery rate estimation
Comput. Stat. Data Anal.
(2007)
BenjaminiYoav et al.
Controlling the false discovery rate: a practical and powerful approach to multiple testing
J. R. Stat. Soc. B
(1995)
BertinKarine et al.
Adaptive pointwise estimation of conditional density function
(2013)
BertinKarine et al.
Adaptive pointwise estimation of conditional density function
Ann. Inst. H. Poincaré Probab. Stat.
(2016)
ButuceaCristina
Two adaptive rates of convergence in pointwise density estimation
Math. Methods Stat.
(2000)
ChagnyGaëlle
Penalization versus Goldenshluger– Lepski strategies in warped bases regression
ESAIM Probab. Stat.
(2013)
ChichignoudMichaël et al.
Adaptive wavelet multivariate regression with errors in variables
Electron. J. Stat.
(2017)

ComteFabienne

Estimation Non-ParamÉTrique

(2015)

ComteFabienne et al.

Adaptive estimation of the conditional intensity of marker-dependent counting processes

Ann. Inst. Henri Poincaré Probab. Stat.

(2011)

ComteFabienne et al.

Anisotropic adaptive kernel deconvolution

Ann. Inst. Henri Poincaré Probab. Stat.

(2013)

DoumicMarie et al.

Nonparametric estimation of the division rate of a size-structured population

SIAM J. Numer. Anal.

(2012)

EfronBradley et al.

Empirical Bayes analysis of a microarray experiment

J. Am. Stat. Assoc.

(2001)

Cited by (0)

View full text

Adaptive nonparametric estimation of a component density in a two-class mixture model

Abstract

Introduction

Section snippets

Collection of kernel estimators for the target density

Adaptive pointwise estimation

Estimation of the mixture density g and the mixing proportion θ

Simulated data

Proofs

Acknowledgements

J. Statist. Plann. Inference

Stochastic Process. Appl.

J. Statist. Plann. Inference

Comput. Stat. Data Anal.

Controlling the false discovery rate: a practical and powerful approach to multiple testing

J. R. Stat. Soc. B

Adaptive pointwise estimation of conditional density function

Adaptive pointwise estimation of conditional density function

Ann. Inst. H. Poincaré Probab. Stat.

Two adaptive rates of convergence in pointwise density estimation

Math. Methods Stat.

Penalization versus Goldenshluger– Lepski strategies in warped bases regression

ESAIM Probab. Stat.

Adaptive wavelet multivariate regression with errors in variables

Electron. J. Stat.

Estimation Non-ParamÉTrique

Adaptive estimation of the conditional intensity of marker-dependent counting processes

Ann. Inst. Henri Poincaré Probab. Stat.

Anisotropic adaptive kernel deconvolution

Ann. Inst. Henri Poincaré Probab. Stat.

Nonparametric estimation of the division rate of a size-structured population

SIAM J. Numer. Anal.

Empirical Bayes analysis of a microarray experiment

J. Am. Stat. Assoc.

Estimation of the mixture density $g$ and the mixing proportion $θ$