Projection theorems and estimating equations for power-law models

https://doi.org/10.1016/j.jmva.2021.104734Get rights and content

Abstract

We extend projection theorems concerning Hellinger and Jones et al. divergences to the continuous case. These projection theorems reduce certain estimation problems on generalized exponential models to linear problems. We introduce the notion of regularity for generalized exponential models and show that the projection theorems in this case are similar to the ones in discrete and canonical case. We also apply these ideas to solve certain estimation problems concerning Student and Cauchy distributions.

Introduction

Divergence is a non-negative extended real-valued function D defined for any pair of probability distributions (p,q) satisfying D(p,q)=0 if and only if p=q. Minimum divergence (or distance) method is popular in statistical inference because of its many desirable properties including robustness and efficiency [6], [51]. Minimization of information divergence (I-divergence) or relative entropy is closely related to the maximum likelihood estimation (MLE) [25, Lem. 3.1]. MLE is not a preferred method when the data is contaminated by outliers. However, I-divergence can be extended by replacing the logarithmic function by some power function to produce divergences that are robust to outliers [5], [16], [36]. In this paper, we consider three such families of divergences that are well-known in the context of robust statistics. They are defined as follows.

Let p and q be probability distributions having a common support SRd. Let α>0,α1.

  • (a)

    The Hellinger divergence Dα (also known as Cressie–Read power divergence [17] or power divergence [52] and, up-to a monotone function, same as Rényi divergence [55]): Dα(p,q)1α1p(x)αq(x)1αdx1.

  • (b)

    The Basu et al. divergence Bα (also known as power pseudo-distance [11], [12], density power divergence [5], [37], [49], β-divergence [47]): Bα(p,q)α1αp(x)q(x)α1dx11αp(x)αdx+q(x)αdx.

  • (c)

    The Jones et al. divergence α [30], [44], [57], [58] (also known as relative α-entropy [40], [41], Rényi pseudo-distance [11], [12], logarithmic density power divergence [45], projective power divergence [28], γ-divergence [16], [30]): α(p,q)α1αlnp(x)q(x)α1dx11αlnp(x)αdx+lnq(x)αdx.

Throughout the paper we assume that all the integrals are well defined over S. The integrals are with respect to the Lebesgue measure on Rd in the continuous case and with respect to the counting measure in the discrete case. Many well-known divergences fall in the above classes of divergences. For example, Chi-square divergence, Bhattacharyya distance [8] and Hellinger distance [7] fall in the Dα-divergence class; Cauchy–Schwarz divergence [54, Eq. (2.90)] falls in the α-divergence class; squared Euclidean distance falls in the Bα-divergence class [5]. All three classes of divergences coincide with the I-divergence as α1 [16], where I(p,q)p(x)lnp(x)q(x)dx.In this sense, each of these three classes of divergences can be regarded as a generalization of I-divergence.

Dα-divergences also arise as generalized cut-off rates in information theory [21]. Bα-divergences belong to the Bregman class which is characterized by transitive projection rules [20, Eq. (3.2), Theorem 3], [37, Example 3]. α-divergences (for α<1) arise in information theory as a redundancy measure in the mismatched cases of guessing [57], source coding [41] and encoding of tasks [14]. The three classes of divergences are closely related to robust estimation, for α>1 in case of Bα and α, and α<1 in case of Dα, as we shall see now.

Let X1,,Xn be an independent and identically distributed (i.i.d.) sample drawn from an unknown distribution p. Let us suppose that p is a member of a parametric family of probability distributions Π={pθ:θΘ}, where Θ is an open subset of Rk and all pθ have a common support SRd. MLE picks the distribution pθΠ that would have most likely caused the sample. MLE solves the so-called score equation or estimating equation for θ, given by 1nj=1ns(Xj;θ)=0,where s(x;θ)lnpθ(x), called the score function and stands for gradient with respect to θ. In the discrete case, the above equation can be re-written as xSpn(x)s(x;θ)=0,where pn is the empirical measure of the sample X1,,Xn.

Let us now suppose that the sample X1,,Xn is from a mixture distribution of the form pϵ=(1ϵ)p+ϵδ, ϵ[0,1), where p is supposed to be a member of Π={pθ:θΘ}; p is regarded as the distribution of “true” samples and δ, that of outliers. Assume that support of δ is a subset of S. While the usual MLE tries to fit a distribution for pϵ, robust estimation tries to fit for pθ. Throughout the paper, the above will be the setup in all the estimation problems, unless otherwise stated. Thus for robust estimation, one needs to modify the estimating equation so that the effect of outliers is down-weighted. The following modified estimating equation, referred as generalized Hellinger estimating equation, was proposed in [4], where the score function was weighted by pn(x)αpθ(x)1α instead of pn(x) in (6): xSpn(x)αpθ(x)1αs(x;θ)=0,where α(0,1). This was proposed based on the following intuition. If x is an outlier, then pn(x)αpθ(x)1α will be smaller than pn(x) for sufficiently smaller values of α. Hence the terms corresponding to outliers in (7) are down-weighted (c.f. [6, Section 4.3] and the references therein).

Notice that (7) does not extend to continuous case due to the appearance of pnα. However in literature, to avoid this technical difficulty, some smoothing techniques such as kernel density estimation [7, Section 3], [6, Section 3.1, 3.2.1], Basu–Lindsay approach [6, Section 3.5], Cao et al. modified approach [15] and so on are used for a continuous estimate of pn. The resulting estimating equation is of the form p˜n(x)αpθ(x)1αs(x;θ)dx=0,where p˜n is some continuous estimate of pn. To avoid this smoothing, Broniatowski et al. derived a duality technique where one first finds a dual representation for the Hellinger distance and then minimizes the empirical estimate of this dual representation to find the estimator. The empirical estimate of this dual representation does not require any smoothing. See [9], [10], [11], [12], [48], [60] for details.

The following estimating equation, where the score function is weighted by power of model density and equated to its hypothetical one, was proposed by Basu et al. [5]: 1nj=1npθ(Xj)α1s(Xj,θ)=pθ(x)αs(x,θ)dx,where α>1. Motivated by the works of Field and Smith [29] and Windham [66], an alternate estimating equation, where the weights are further normalized, was proposed by Jones et al. [36]: 1nj=1npθ(Xj)α1s(Xj;θ)1nj=1npθ(Xj)α1=pθ(x)αs(x;θ)dxpθ(x)αdx,where α>1. Notice that (9), (10) do not require the use of empirical distribution. Hence no smoothing is required in these cases. The estimators of (8), (9), (10) are consistent and asymptotically normal [5, Theorem 2], [36, Section 3], [7, Theorem 3]. They also satisfy two invariance properties, one when the underlying model is re-parameterized by a one-one function of the parameter [5, Section 3.4], and the other when the samples are replaced by some of their linear transformation [59, Theorem 3.1], [5, Section 3.4]. They coincide with the ML-estimating equation (5) when α=1 under the condition that pθ(x)s(x;θ)dx=0. The estimating equations (5), (8), (9), (10) are, respectively, associated with the divergences in (4), (1), (2), and (3) in a sense that will be made clear in the following.

Observe that the estimating equations (5), (8), (9), and (10) are implications of the first order optimality condition of maximizing, respectively, the usual log-likelihood function L(θ)1nj=1nlnpθ(Xj),and the following generalized likelihood functions L1(α)(θ)11αp˜n(x)αpθ(x)1αdx,L2(α)(θ)1nj=1nαpθ(Xj)α11α1pθ(x)αdx,L3(α)(θ)αα1ln1nj=1npθ(Xj)α1lnpθ(x)αdx. The above likelihood functions (12), (13), (14) are not defined for α=1. However it can be shown that they all coincide with L(θ) as α1.

It is easy to see that the probability distribution pθ that maximizes (12), (11), (13) or (14) is same as, respectively, the one that minimizes Dα(p˜n,pθ) or the empirical estimates of I(p,pθ), Bα(pϵ,pθ) or α(pϵ,pθ). Thus for MLE or “robustified MLE”, one needs to solve infpθΠD(p̄n,pθ),where D is either I, Dα, Bα or α; p̄n=pn when D is I,Bα or α and p̄n=p˜n when D is Dα. Notice that (8) for α>1, (9), (10) for α<1, do not make sense in terms of robustness. However, they still serve as first order optimality condition for the divergence minimization problem (15). A probability distribution that attains the infimum is known as a reverse D-projection of p̄n on Π.

A “dual” minimization problem is the so-called forward projection problem, where the minimization is over the first argument of the divergence function. Given a set of probability distributions with support S and a probability distribution q with the same support, any p that attains infpD(p,q)is called a forward D-projection of q on . Forward projection is usually on a convex set or on an α-convex set of probability distributions. Forward projection on a convex set is motivated by the well-known maximum entropy principle of statistical physics [34]. Motivation for forward projection on α-convex set comes from the so-called non-extensive statistical physics [40], [61], [62], [63]. Forward I-projection on convex set was extensively studied by Csiszár [18], [19], [22], Csiszár and Matúš [23], [24], Csiszár and Shields [25], and Csiszár and Tusnády [26].

The forward projections of either of the divergences in (1)–(4) on convex (or α-convex) sets of probability distributions yield a parametric family of probability distributions. A reverse projection on this parametric family turns into a forward projection on the convex (or α-convex) set, which further reduces to solving a system of linear equations. We call such a result a projection theorem of the divergence. These projection theorems were mainly due to an “orthogonal” relationship between the convex (or the α-convex) family and the associated parametric family. The Pythagorean theorem of the associated divergence plays a key role in this context.

Projection theorem of the I-divergence is due to Csiszár and Shields [25, pp. 24] where the convex family is a linear family and the associated parametric family is an exponential family. Projection theorem for α-divergence was established by Kumar and Sundaresan [41, Theorem 18 and Theorem 21], where the so-called α-power-law family (M(α)-family) plays the role of the exponential family. Projection theorem for Dα-divergence was established by Kumar and Sason [39, Theorem 6], where a variant of the α-power-law family, called α-exponential family (E(α)-family), plays the role of the exponential family and the so-called α-linear family plays the role of the linear family. Projection theorem for more general class of Bregman divergences, in which Bα is a subclass, was established by Csiszár and Matúš [24] using techniques from convex analysis. (See also [50].) We observe that the parametric family associated with the projection theorem of Bα-divergence is closely related to the α-power-law family, which we call a B(α)-family.

Thus projection theorems enable us to find the estimator (MLE or any of the generalized estimators) as a forward projection if the estimation is done under a specific parametric family. While for MLE the required family is exponential, for the generalized estimations, it is one of the power-law families.

Our main contributions in this paper are the following.

  • 1.

    The projection theorem for α-divergence is known in the literature only for the discrete, canonical case. We first define the associated power-law family M(α) in a more general setup and establish projection theorem for α on M(α).

  • 2.

    We derive the projection theorem for Dα-divergence on E(α)-family in more generality by establishing a one-to-one correspondence between this problem and the projection problem concerning α-divergence on M(α)-family.

  • 3.

    We introduce the concept of regularity (full-rank family) for the power-law families B(α), M(α) and E(α). We also establish a close relationship among them.

  • 4.

    We show that the Cauchy distributions (also known as q-Gaussian distributions [32], [46], [50], [53], [64]) are the escort distributions of the Student distributions [28], [35]. Also Cauchy and Student distributions, respectively, form regular E(α) and regular M(α) (and B(α)) families.

  • 5.

    We find some generalized estimators for the location and scale parameters of the Student and Cauchy distributions using the projection theorems of the Jones et al. and Hellinger divergences. We also observe that these projection theorems cannot be applied when the distributions are compactly supported. In this case the estimators should be found on a case by case basis. We find estimators in one such a case and compare it with MLE.

Rest of the paper is organized as follows. In Section 2, we first generalize the power-law families to the continuous case and show that the Student and Cauchy distributions belong to this class. We also introduce the notion of regularity to these power-law families and establish the relationship among them in this section. In Section 3, we establish projection theorems for the general power-law families. In Section 4, we apply projection theorems to Student and Cauchy distributions to find generalized estimators for their parameters. We also perform some simulations to analyze the efficacy of such estimators. We end the paper with a summary and concluding remarks in Section 5. In the supplementary article, we establish projection theorem of Bα-divergence in the discrete case using elementary tools and identify the parametric family associated with this divergence. We also present the detailed derivation of some of the results in this part.

Section snippets

The power-law families: definition and examples

In this section, we define the power-law families associated with the projection theorems of the divergences Bα, α and Dα in a more general set-up than they are studied in the literature. We also introduce the concept of regularity for these families. In the literature such a notion for exponential family has been studied, which sometimes is referred as full-rank family (see [33], [42]). We then make a comparison among these families. We also show that the well-known Student and Cauchy

Projection theorems for general power-law families

In this section, we extend the projection theorems of Bα, α and Dα divergences to the general power-law families by directly solving the associated estimating equations. We also find conditions under which the new projection theorems reduce to the ones as in the canonical case. We shall begin by recalling the projection theorems known in the literature. In the following, assume that the families are canonical and regular with support S being finite and the parameter space Θ being the natural

Applications: Generalized estimation under student and Cauchy distributions

In this section we find Jones et al. estimators [36] for the parameters of Student distribution for ν(0,) and generalized Hellinger estimators [6] of Cauchy distribution for β(1,(d+2)2). For the estimation of Cauchy distributions we use the kernel density estimate for the empirical measure. We also find a robust estimator of the mean parameter of Student distribution for the case when ν(0,).

Summary and concluding remarks

Projection theorems of Jones et al. (α) and Hellinger (Dα) divergences tell us that the reverse projection, respectively, on the power-law families M(α) and E(α) turns out to be a forward projection on a “simpler” (linear or α-linear) family which, in turn, reduces to a linear problem on the underlying probability distribution. The applicability of these projection theorems known in the literature was limited as they dealt only discrete and canonical models. In this work, we first generalized

CRediT authorship contribution statement

Atin Gayen: Conceptualization, Methodology, Writing - original draft, Investigation, Software, Data curation, Validation, Funding acquisition. M. Ashok Kumar: Conceptualization, Methodology, Writing - review & editing, Supervision, Validation.

Acknowledgments

Atin Gayen is supported by an INSPIRE fellowship of the Department of Science and Technology, Govt. of India . Part of this work was carried out when the authors were with the Indian Institute of Technology Indore. The authors would like to thank Professor Arup Bose for his constructive comments. The authors would also like to thank Professor Michel Broniatowski for the discussions they had with him during his visit to India through the VAJRA programme of Govt. of India. The authors also thank

References (66)

  • BasuA. et al.

    Robust minimum divergence procedures for count data models

    Sankhya: Indian J. Stat.

    (1997)
  • BasuA. et al.

    Robust and efficient estimation by minimizing a density power divergence

    Biometrika

    (1998)
  • BasuA. et al.
  • BeranR.

    Minimum hellinger distance estimates for parametric models

    Ann. Statist.

    (1977)
  • BhattacharyyaA.

    On a measure of divergence between two statistical populations defined by their probability distributions

    Bull. Calcutta Math. Soc

    (1943)
  • BroniatowskiM. et al.

    Several applications of divergence criteria in continuous families

    Kybernetika (Prague)

    (2012)
  • BrownL.D.

    Fundamentals of Statistical Exponential Families: With Applications in Statistical Decision Theory

    (1986)
  • BunteC. et al.

    Encoding tasks and rényi entropy

    IEEE Trans. Inform. Theory

    (2014)
  • CaoR. et al.

    Minimum distance density-based estimation

    Comput. Statist. Data Anal.

    (1995)
  • CichockiA. et al.

    Families of alpha-beta-and gamma-divergences: Flexible and robust measure of similarities

    Entropy

    (2010)
  • CressieN. et al.

    Multinomial goodness-of-fit tests

    J. R. Stat. Soc. Ser. B. Stat. Methodol.

    (1984)
  • CsiszaŕI.

    I-divergence geometry of probability distributions and minimization problems

    Ann. Probab.

    (1975)
  • CsiszaŕI.

    Sanov property generalized i-projection and a conditional limit theorem

    Ann. Probab.

    (1984)
  • CsiszaŕI.

    Why least squares and maximum entropy? an axiomatic approach to inference for linear inverse problems

    Ann. Statist.

    (1991)
  • CsiszaŕI.

    Generalized cutoff rates and rényi’s information measures

    IEEE Trans. Inform. Theory.

    (1995)
  • CsiszaŕI.

    Generalized projections for non-negative functions

    Acta Math. Hungar.

    (1995)
  • CsiszaŕI. et al.

    Information projections revisited

    IEEE Trans. Inform. Theory.

    (2003)
  • CsiszaŕI. et al.

    Generalized minimizers of convex integral functionals, bergman distance, pythagorean identities

    Kybernetika (Prague)

    (2012)
  • CsiszárI. et al.

    Information Theory and Statistics: A Tutorial

    (2004)
  • CsiszaŕI. et al.

    Information geometry and alternating minimization procedures

    Stat. Decis.

    (1984)
  • DempsterA.P. et al.

    Maximum likelihood from incomplete data via the em algorithm

    J. R. Stat. Soc. Ser. B. Stat. Methodol.

    (1977)
  • EguchiS. et al.

    Projective power entropy and maximum tsallis entropy distributions

    Entropy

    (2011)
  • FieldC. et al.

    Robust estimation: A weighted maximum likelihood approach

    Int. Stat. Rev.

    (1994)
  • Cited by (9)

    View all citing articles on Scopus
    View full text