Projection theorems and estimating equations for power-law models
Introduction
Divergence is a non-negative extended real-valued function defined for any pair of probability distributions satisfying if and only if . Minimum divergence (or distance) method is popular in statistical inference because of its many desirable properties including robustness and efficiency [6], [51]. Minimization of information divergence (-divergence) or relative entropy is closely related to the maximum likelihood estimation (MLE) [25, Lem. 3.1]. MLE is not a preferred method when the data is contaminated by outliers. However, -divergence can be extended by replacing the logarithmic function by some power function to produce divergences that are robust to outliers [5], [16], [36]. In this paper, we consider three such families of divergences that are well-known in the context of robust statistics. They are defined as follows.
Let and be probability distributions having a common support . Let .
- (a)
The Hellinger divergence (also known as Cressie–Read power divergence [17] or power divergence [52] and, up-to a monotone function, same as Rényi divergence [55]):
- (b)
The Basu et al. divergence (also known as power pseudo-distance [11], [12], density power divergence [5], [37], [49], -divergence [47]):
- (c)
The Jones et al. divergence [30], [44], [57], [58] (also known as relative -entropy [40], [41], Rényi pseudo-distance [11], [12], logarithmic density power divergence [45], projective power divergence [28], -divergence [16], [30]):
Throughout the paper we assume that all the integrals are well defined over . The integrals are with respect to the Lebesgue measure on in the continuous case and with respect to the counting measure in the discrete case. Many well-known divergences fall in the above classes of divergences. For example, Chi-square divergence, Bhattacharyya distance [8] and Hellinger distance [7] fall in the -divergence class; Cauchy–Schwarz divergence [54, Eq. (2.90)] falls in the -divergence class; squared Euclidean distance falls in the -divergence class [5]. All three classes of divergences coincide with the -divergence as [16], where In this sense, each of these three classes of divergences can be regarded as a generalization of -divergence.
-divergences also arise as generalized cut-off rates in information theory [21]. -divergences belong to the Bregman class which is characterized by transitive projection rules [20, Eq. (3.2), Theorem 3], [37, Example 3]. -divergences (for ) arise in information theory as a redundancy measure in the mismatched cases of guessing [57], source coding [41] and encoding of tasks [14]. The three classes of divergences are closely related to robust estimation, for in case of and , and in case of , as we shall see now.
Let be an independent and identically distributed (i.i.d.) sample drawn from an unknown distribution . Let us suppose that is a member of a parametric family of probability distributions , where is an open subset of and all have a common support . MLE picks the distribution that would have most likely caused the sample. MLE solves the so-called score equation or estimating equation for , given by where , called the score function and stands for gradient with respect to . In the discrete case, the above equation can be re-written as where is the empirical measure of the sample .
Let us now suppose that the sample is from a mixture distribution of the form , , where is supposed to be a member of ; is regarded as the distribution of “true” samples and , that of outliers. Assume that support of is a subset of . While the usual MLE tries to fit a distribution for , robust estimation tries to fit for . Throughout the paper, the above will be the setup in all the estimation problems, unless otherwise stated. Thus for robust estimation, one needs to modify the estimating equation so that the effect of outliers is down-weighted. The following modified estimating equation, referred as generalized Hellinger estimating equation, was proposed in [4], where the score function was weighted by instead of in (6): where . This was proposed based on the following intuition. If is an outlier, then will be smaller than for sufficiently smaller values of . Hence the terms corresponding to outliers in (7) are down-weighted (c.f. [6, Section 4.3] and the references therein).
Notice that (7) does not extend to continuous case due to the appearance of . However in literature, to avoid this technical difficulty, some smoothing techniques such as kernel density estimation [7, Section 3], [6, Section 3.1, 3.2.1], Basu–Lindsay approach [6, Section 3.5], Cao et al. modified approach [15] and so on are used for a continuous estimate of . The resulting estimating equation is of the form where is some continuous estimate of . To avoid this smoothing, Broniatowski et al. derived a duality technique where one first finds a dual representation for the Hellinger distance and then minimizes the empirical estimate of this dual representation to find the estimator. The empirical estimate of this dual representation does not require any smoothing. See [9], [10], [11], [12], [48], [60] for details.
The following estimating equation, where the score function is weighted by power of model density and equated to its hypothetical one, was proposed by Basu et al. [5]: where . Motivated by the works of Field and Smith [29] and Windham [66], an alternate estimating equation, where the weights are further normalized, was proposed by Jones et al. [36]: where . Notice that (9), (10) do not require the use of empirical distribution. Hence no smoothing is required in these cases. The estimators of (8), (9), (10) are consistent and asymptotically normal [5, Theorem 2], [36, Section 3], [7, Theorem 3]. They also satisfy two invariance properties, one when the underlying model is re-parameterized by a one-one function of the parameter [5, Section 3.4], and the other when the samples are replaced by some of their linear transformation [59, Theorem 3.1], [5, Section 3.4]. They coincide with the ML-estimating equation (5) when under the condition that . The estimating equations (5), (8), (9), (10) are, respectively, associated with the divergences in (4), (1), (2), and (3) in a sense that will be made clear in the following.
Observe that the estimating equations (5), (8), (9), and (10) are implications of the first order optimality condition of maximizing, respectively, the usual log-likelihood function and the following generalized likelihood functions The above likelihood functions (12), (13), (14) are not defined for . However it can be shown that they all coincide with as .
It is easy to see that the probability distribution that maximizes (12), (11), (13) or (14) is same as, respectively, the one that minimizes or the empirical estimates of , or . Thus for MLE or “robustified MLE”, one needs to solve where is either , , or ; when is or and when is . Notice that (8) for , (9), (10) for , do not make sense in terms of robustness. However, they still serve as first order optimality condition for the divergence minimization problem (15). A probability distribution that attains the infimum is known as a reverse -projection of on .
A “dual” minimization problem is the so-called forward projection problem, where the minimization is over the first argument of the divergence function. Given a set of probability distributions with support and a probability distribution with the same support, any that attains is called a forward -projection of on . Forward projection is usually on a convex set or on an -convex set of probability distributions. Forward projection on a convex set is motivated by the well-known maximum entropy principle of statistical physics [34]. Motivation for forward projection on -convex set comes from the so-called non-extensive statistical physics [40], [61], [62], [63]. Forward -projection on convex set was extensively studied by Csiszár [18], [19], [22], Csiszár and Matúš [23], [24], Csiszár and Shields [25], and Csiszár and Tusnády [26].
The forward projections of either of the divergences in (1)–(4) on convex (or -convex) sets of probability distributions yield a parametric family of probability distributions. A reverse projection on this parametric family turns into a forward projection on the convex (or -convex) set, which further reduces to solving a system of linear equations. We call such a result a projection theorem of the divergence. These projection theorems were mainly due to an “orthogonal” relationship between the convex (or the -convex) family and the associated parametric family. The Pythagorean theorem of the associated divergence plays a key role in this context.
Projection theorem of the -divergence is due to Csiszár and Shields [25, pp. 24] where the convex family is a linear family and the associated parametric family is an exponential family. Projection theorem for -divergence was established by Kumar and Sundaresan [41, Theorem 18 and Theorem 21], where the so-called -power-law family (-family) plays the role of the exponential family. Projection theorem for -divergence was established by Kumar and Sason [39, Theorem 6], where a variant of the -power-law family, called -exponential family (-family), plays the role of the exponential family and the so-called -linear family plays the role of the linear family. Projection theorem for more general class of Bregman divergences, in which is a subclass, was established by Csiszár and Matúš [24] using techniques from convex analysis. (See also [50].) We observe that the parametric family associated with the projection theorem of -divergence is closely related to the -power-law family, which we call a -family.
Thus projection theorems enable us to find the estimator (MLE or any of the generalized estimators) as a forward projection if the estimation is done under a specific parametric family. While for MLE the required family is exponential, for the generalized estimations, it is one of the power-law families.
Our main contributions in this paper are the following.
- 1.
The projection theorem for -divergence is known in the literature only for the discrete, canonical case. We first define the associated power-law family in a more general setup and establish projection theorem for on .
- 2.
We derive the projection theorem for -divergence on -family in more generality by establishing a one-to-one correspondence between this problem and the projection problem concerning -divergence on -family.
- 3.
We introduce the concept of regularity (full-rank family) for the power-law families , and . We also establish a close relationship among them.
- 4.
We show that the Cauchy distributions (also known as -Gaussian distributions [32], [46], [50], [53], [64]) are the escort distributions of the Student distributions [28], [35]. Also Cauchy and Student distributions, respectively, form regular and regular (and ) families.
- 5.
We find some generalized estimators for the location and scale parameters of the Student and Cauchy distributions using the projection theorems of the Jones et al. and Hellinger divergences. We also observe that these projection theorems cannot be applied when the distributions are compactly supported. In this case the estimators should be found on a case by case basis. We find estimators in one such a case and compare it with MLE.
Rest of the paper is organized as follows. In Section 2, we first generalize the power-law families to the continuous case and show that the Student and Cauchy distributions belong to this class. We also introduce the notion of regularity to these power-law families and establish the relationship among them in this section. In Section 3, we establish projection theorems for the general power-law families. In Section 4, we apply projection theorems to Student and Cauchy distributions to find generalized estimators for their parameters. We also perform some simulations to analyze the efficacy of such estimators. We end the paper with a summary and concluding remarks in Section 5. In the supplementary article, we establish projection theorem of -divergence in the discrete case using elementary tools and identify the parametric family associated with this divergence. We also present the detailed derivation of some of the results in this part.
Section snippets
The power-law families: definition and examples
In this section, we define the power-law families associated with the projection theorems of the divergences , and in a more general set-up than they are studied in the literature. We also introduce the concept of regularity for these families. In the literature such a notion for exponential family has been studied, which sometimes is referred as full-rank family (see [33], [42]). We then make a comparison among these families. We also show that the well-known Student and Cauchy
Projection theorems for general power-law families
In this section, we extend the projection theorems of , and divergences to the general power-law families by directly solving the associated estimating equations. We also find conditions under which the new projection theorems reduce to the ones as in the canonical case. We shall begin by recalling the projection theorems known in the literature. In the following, assume that the families are canonical and regular with support being finite and the parameter space being the natural
Applications: Generalized estimation under student and Cauchy distributions
In this section we find Jones et al. estimators [36] for the parameters of Student distribution for and generalized Hellinger estimators [6] of Cauchy distribution for . For the estimation of Cauchy distributions we use the kernel density estimate for the empirical measure. We also find a robust estimator of the mean parameter of Student distribution for the case when .
Summary and concluding remarks
Projection theorems of Jones et al. () and Hellinger () divergences tell us that the reverse projection, respectively, on the power-law families and turns out to be a forward projection on a “simpler” (linear or -linear) family which, in turn, reduces to a linear problem on the underlying probability distribution. The applicability of these projection theorems known in the literature was limited as they dealt only discrete and canonical models. In this work, we first generalized
CRediT authorship contribution statement
Atin Gayen: Conceptualization, Methodology, Writing - original draft, Investigation, Software, Data curation, Validation, Funding acquisition. M. Ashok Kumar: Conceptualization, Methodology, Writing - review & editing, Supervision, Validation.
Acknowledgments
Atin Gayen is supported by an INSPIRE fellowship of the Department of Science and Technology, Govt. of India . Part of this work was carried out when the authors were with the Indian Institute of Technology Indore. The authors would like to thank Professor Arup Bose for his constructive comments. The authors would also like to thank Professor Michel Broniatowski for the discussions they had with him during his visit to India through the VAJRA programme of Govt. of India. The authors also thank
References (66)
- et al.
On maximum entropy principle superstatistics power-law distribution and rényi parameter
Phys. A.
(2004) Minimum divergence estimators, maximum likelihood and exponential families
Stat. Probabil. Lett.
(2014)- et al.
Parametric estimation and tests through divergences and the duality technique
J. Multivariate Anal.
(2009) - et al.
Decomposable pseudo-distances and applications in statistical estimation
J. Statist. Plann. Inference
(2012) - et al.
Robust parameter estimation with a small bias against heavy contamination
J. Multivariate Anal.
(2008) - et al.
Some results concerning maximum rényi entropy distributions
Ann. Inst. Henri Poincaré Probab. Stat.
(2007) - et al.
Dual divergence estimators and tests: Robustness results
J. Multivariate Anal.
(2011) - et al.
The role of constraints within generalized non-extensive statistics
Phys. A.
(1998) - et al.
On multivariate truncated generalized cauchy distribution
Statist. Papers
(2013) Evaluation of the maximum-likelihood estimator where the likelihood equation has multiple roots
Biometrika
(1966)
Robust minimum divergence procedures for count data models
Sankhya: Indian J. Stat.
Robust and efficient estimation by minimizing a density power divergence
Biometrika
Minimum hellinger distance estimates for parametric models
Ann. Statist.
On a measure of divergence between two statistical populations defined by their probability distributions
Bull. Calcutta Math. Soc
Several applications of divergence criteria in continuous families
Kybernetika (Prague)
Fundamentals of Statistical Exponential Families: With Applications in Statistical Decision Theory
Encoding tasks and rényi entropy
IEEE Trans. Inform. Theory
Minimum distance density-based estimation
Comput. Statist. Data Anal.
Families of alpha-beta-and gamma-divergences: Flexible and robust measure of similarities
Entropy
Multinomial goodness-of-fit tests
J. R. Stat. Soc. Ser. B. Stat. Methodol.
I-divergence geometry of probability distributions and minimization problems
Ann. Probab.
Sanov property generalized i-projection and a conditional limit theorem
Ann. Probab.
Why least squares and maximum entropy? an axiomatic approach to inference for linear inverse problems
Ann. Statist.
Generalized cutoff rates and rényi’s information measures
IEEE Trans. Inform. Theory.
Generalized projections for non-negative functions
Acta Math. Hungar.
Information projections revisited
IEEE Trans. Inform. Theory.
Generalized minimizers of convex integral functionals, bergman distance, pythagorean identities
Kybernetika (Prague)
Information Theory and Statistics: A Tutorial
Information geometry and alternating minimization procedures
Stat. Decis.
Maximum likelihood from incomplete data via the em algorithm
J. R. Stat. Soc. Ser. B. Stat. Methodol.
Projective power entropy and maximum tsallis entropy distributions
Entropy
Robust estimation: A weighted maximum likelihood approach
Int. Stat. Rev.
Cited by (9)
A unifying framework for some directed distances in statistics
2022, Handbook of StatisticsGeneralized Fisher-Darmois-Koopman-Pitman Theorem and Rao-Blackwell Type Estimators for Power-Law Distributions
2023, IEEE Transactions on Information TheoryConformal mirror descent with logarithmic divergences
2023, Information GeometryPOWER-LAW LÉVY PROCESSES, POWER-LAW VECTOR RANDOM FIELDS, AND SOME EXTENSIONS
2023, Proceedings of the American Mathematical Society