Skip to main content
Log in

Statistical Estimation of Mutual Information for Mixed Model

  • Published:
Methodology and Computing in Applied Probability Aims and scope Submit manuscript

Abstract

Asymptotic unbiasedness and L2-consistency are established for various statistical estimates of mutual information in the mixed models framework. Such models are important, e.g., for analysis of medical and biological data. The study of the conditional Shannon entropy as well as new results devoted to statistical estimation of the differential Shannon entropy are employed essentially. Theoretical results are completed by computer simulations for logistic regression model with different parameters. The numerical experiments demonstrate that new statistics, proposed by the authors, have certain advantages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Balagani K, Phoha V (2010) On the feature selection criterion based on an approximation of multidimensional mutual information. IEEE Trans Pattern Anal Mach Intell 32(7):1342–1343

    Article  Google Scholar 

  • Bennasar M, Hicks Y, Setchi R (2014) Feature selection using joint mutual information maximisation. Expert Syst Appl 42:8520–8532

    Article  Google Scholar 

  • Berrett TB, Samworth RJ, Yuan M (2016) Efficient multivariate entropy estimation via k-nearest neighbour distances. arXiv:160600304

  • Biau G, Devroy L (2015) Lectures on the nearest neighbor method. Springer-Verlag, New York

    Book  Google Scholar 

  • Bulinski A, Dimitrov D (2019a) Statistical estimation of the Kullback - Leibler divergence. arXiv:190700196

  • Bulinski A, Dimitrov D (2019b) Statistical estimation of the Shannon entropy. Acta Math Sinica English Series 35:17–46

  • Bulinski A, Kozhevin A (2017) Modification of the MDR-EFE method for stratified samples. Stat, Optim Inf Comput 5:1–18

    Article  MathSciNet  Google Scholar 

  • Bulinski A, Kozhevin A (2018) Statistical estimation of conditional Shannon entropy. ESAIM: Probability and Statistics pp 1–35 , published online: November 28. https://doi.org/10.1051.ps.2018026

  • Coelho F, Braga A, Verleysen M (2016) A mutual information estimator for continuous and discrete variables applied to feature selection and classification problems. Int J Computat Int Syst 9(4):726–733

    Article  Google Scholar 

  • Delattre S, Fournier N (2017) On the Kozachenko-Leonenko entropy estimator. Journal of Statistical Planning and Inference 185

  • Doquire G, Verleysen M (2012) A comparison of mutual information estimators for feature selection. In: Proc. of the 1st international conference on pattern recognition applications and methods, pp 176–185. https://doi.org/10.5220/0003726101760185

  • Favatti P, Lotti G, Romani F (1991) Algorithm 691: Improving quadpack automatic integration routines. ACM Trans Math Softw 17(2):218–232

    Article  MathSciNet  Google Scholar 

  • Fleuret F (2004) Fast binary feature selection with conditional mutual information. J Mach Learn Res 4:1531–1555

    MathSciNet  MATH  Google Scholar 

  • Gao W, Kannan S, Oh S, Viswanath P (2017) Estimating mutual information for discrete-continuous mixtures. In: 31st conference on neural information processing systems (NIPS), Long Beach, CA, USA, pp 1–12

  • Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J Amer Statist Assoc 58(301):13–30

    Article  MathSciNet  Google Scholar 

  • Kozachenko L, Leonenko N (1987) Sample estimate of the entropy of a random vector. Probl Inf Transm 23:95–101

    MATH  Google Scholar 

  • Kraskov A, Stögbauer H, Grassberger P (2004) Estimating mutual information. Phys Rev E 69:066,138

    Article  MathSciNet  Google Scholar 

  • Macedo F, Oliveira R, Pacheco A, Valadas R (2019) Theoretical foundations of forward feature selection methods based on mutual information. Neurocomputing 325:67–89

    Article  Google Scholar 

  • Massaron L, Boschetti A (2016) Regression analysis with python. Packt Publishing Ltd., Birmingham

    Google Scholar 

  • Nair C, Prabhakar B, Shah D (2007) Entropy for mixtures of discrete and continuous variables. arXiv:0607075v2

  • Novovičová J, Somol P, Haind M, Pudin P (2007) Conditional mutual information based feature selection for classification task. In: Ruedz M D L, Kittler J (eds) CIARP 2007, LNCS, vol 4756. Springer-Verlag, Berlin, pp 417–426

  • Paninski L (2003) Estimation of entropy and mutual information. Neural Comput. 15:1191–1253

    Article  Google Scholar 

  • Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of maxdependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27:1226–1238

    Article  Google Scholar 

  • Vergara J, Estévez P (2014) A review of feature selection methods based on mutual information. Neural Comput Appl 24:175–186. https://doi.org/10.1007/s00521-013-1368-0

    Article  Google Scholar 

  • Verleysen M, Rossi F, Franëcois D (2009) Advances in feature selection with mutual information. arXiv:09090635v1

  • Yeh J (2014) Real analysis: Theory of measure and integration, 3rd edn. World Scientific, Singapore

    Book  Google Scholar 

Download references

Acknowledgements

The work of A.Bulinski is supported by the Russian Science Foundation under grant 19-11-00290 and performed at the Steklov Mathematical Institute of Russian Academy of Sciences. Theoretical results are established in the joint work of A.Bulinski and A.Kozhevin, simulations are realized by A.Kozhevin. The authors are grateful to the Reviewer for his remarks concerning the book by G.Biau and L.Devroy.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexander Bulinski.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Proof

of Lemma 1 Formula Eq. 21 becomes evident if we consider the Bernoulli scheme with a success probability P(Y = y). One has {Ny, n = 0} = {Yqy, q = 1,…,n}, and

$$ \begin{aligned} \{N_{y,n} = k\} &= \cup_{I} \left\{ {j_{1}^{y}} = i_{1}, \ldots, {j_{k}^{y}} = i_{k} \right\}\\ &= \cup_{I} \left( \{ Y_{i} = y, i \in I\} \cap \{Y_{q} \ne y, q \in \{1,\ldots,n\}\setminus I \}\right), \end{aligned} $$

for k = 1,…,n, where ∪I is taken over I = {i1,…,ik} such that 1 ≤ i1 < … < ikn.

Fig. 3
figure 3

Left: n = 100. Right: n = 500

Fig. 4
figure 4

Left: n = 100. Right: n = 500

Thus, for \(m\in \mathbb {N}\), m > k, and any \(B_{i} \in {\mathscr{B}}(\mathbb {R}^{d})\), i = 1,…,m,

$$ \begin{aligned} &{\textsf{P}}\left( X_{{j_{1}^{y}}} \in B_{1}, \ldots, X_{{j_{m}^{y}}} \in B_{m}, N_{y,n} = k\right) \\ &\qquad= \sum\limits_{\substack{ J = \{i_{1}, \ldots, i_{k}, i_{k+1}, \ldots, i_{m} \} \colon \\ 1 \leq i_{1} < {\ldots} < i_{k} \leq n < i_{k+1} < {\ldots} <i_{m} }} {\textsf{P}}\left( X_{i_{1}} \in B_{1}, \ldots, X_{i_{m}} \in B_{m}, {j_{1}^{y}} = i_{1}, \ldots, {j_{m}^{y}} = i_{m}\right) \\ &\qquad= \sum\limits_{p=n+m-k}^{\infty} \sum\limits_{J} {\textsf{P}}\left( X_{i_{1}} \in B_{1}, \ldots, X_{i_{m}} \in B_{m}, {j_{1}^{y}} = i_{1}, \ldots, {j_{m}^{y}} = i_{m}\right) \end{aligned} $$
$$ \begin{array}{@{}rcl@{}} &=& \sum\limits_{p=n+m-k}^{\infty}\sum\limits_{J} {\textsf{P}}(X_{i_{1}}\! \in\! B_{1}, \ldots, X_{i_{m}} \!\in\! B_{m}, \{Y_{i} = y, i \!\in \!J\}, \{Y_{q} \!\ne\! y, q \!\in \!\{1, \ldots, p\} \setminus J\}) \\ &=& \sum\limits_{p=n+m-k}^{\infty} \sum\limits_{J} \prod\limits_{i = 1}^{m} {\textsf{P}}(X \in B_{i}, Y = y) \prod\limits_{q \in \{1, \ldots, p\} \setminus J} {\textsf{P}}(Y \ne y) \\ &=& \prod\limits_{i = 1}^{m} {\textsf{P}}(X \in B_{i}, Y = y) \sum\limits_{p=n+m-k}^{\infty} \sum\limits_{J} {\textsf{P}}(Y \ne y)^{p-m}, \end{array} $$
Fig. 5
figure 5

Left: c = 0.1, Right: c = 10

where, for pn + mk, \({\sum }_{J}\) denotes the sum over all J having the form

$$ J(p,n,m,k)=\{\{i_{1}, \ldots, i_{k}, i_{k+1}, \ldots, i_{m} \} \colon 1 \leq i_{1} < {\ldots} < i_{k} \leq n < i_{k+1} < {\ldots} <i_{m} = p\}. $$

For k = 0, the sum \({\sum }_{J}\) is taken over i1,…,im such that n < i1 < … < im. If J = {i1,…,im}⊂{1,…,p}, where im = p, then \( \{{j_{1}^{y}} = i_{1}, \ldots , {j_{m}^{y}} = p\} = \{Y_{i} = y, i \!\in \! J\}\) ∩{Yqy, q ∈{1,…,p}∖ J}. For p, n, m and k under consideration the cardinality of all collections J(p, n, m, k) is equal to \(\binom {n}{k}\binom {p-n-1}{m-k-1}\). Consequently,

$$ \begin{aligned} &{\textsf{P}}(X_{{j_{1}^{y}}} \in B_{1}, \ldots, X_{{j_{m}^{y}}} \in B_{m}, N_{y,n} = k)\\ &\qquad= \binom{n}{k} \prod\limits_{i = 1}^{m} {\textsf{P}}(X \in B_{i}, Y = y) \sum\limits_{p=n+m-k}^{\infty} \binom{p-n-1}{m-k-1} {\textsf{P}}(Y \ne y)^{p-m}, \end{aligned} $$

where we took into account that (X1, Y1), (X2, Y2),… are i.i.d. random vectors having the same distribution as (X, Y ). Set l = p − (n + mk). Then

$$ \begin{aligned} &\sum\limits_{p=n+m-k}^{\infty} \binom{p-n-1}{m-k-1} {\textsf{P}}(Y \ne y)^{p-m}= \sum\limits_{l=0}^{\infty}\binom{l+m-k-1}{m-k-1}{\textsf{P}}(Y \ne y)^{l+n-k}\\ &\qquad= {\textsf{P}}(Y \ne y)^{n-k}{\textsf{P}}(Y=y)^{-(m-k)}\sum\limits_{l=0}^{\infty}\binom{l+m-k-1}{l}{\textsf{P}}(Y \ne y)^{l}{\textsf{P}}(Y=y)^{m-k}\\ &\qquad={\textsf{P}}(Y \ne y)^{n-k}{\textsf{P}}(Y=y)^{k} {\textsf{P}}(Y=y)^{-m}, \end{aligned} $$

since \({\sum }_{l=0}^{\infty }\binom {l+m-k-1}{l}{\textsf {P}}(Y \ne y)^{l}{\textsf {P}}(Y=y)^{m-k}=1\) for negative binomial distribution. Thus

$$ {\textsf{P}}(X_{{j_{1}^{y}}} \in B_{1}, \ldots, X_{{j_{m}^{y}}} \in B_{m}, N_{y,n} = k) = \binom{n}{k}{\textsf{P}}(Y = y)^{k}{\textsf{P}}(Y \!\ne\! y)^{n-k} \prod\limits_{i = 1}^{m} {\textsf{P}}(X \in B_{i}| Y = y). $$

The latter relation yields all the Lemma statements when m > k.

Let now 1 ≤ mk, where k ∈{0,…,n}. The only possible case is k > 0, and hence

$$ \begin{array}{@{}rcl@{}} {\textsf{P}}&&{}\left( X_{{j_{1}^{y}}} \in B_{1}, \ldots, X_{{j_{m}^{y}}} \in B_{m}, N_{y,n} = k\right)\\ &=& \sum\limits_{\substack{ I = \{i_{1}, \ldots, i_{k} \} \colon \\ 1 \leq i_{1} < {\ldots} < i_{k} \leq n }}{\textsf{P}}\left( X_{i_{1}}\in B_{1},\ldots, X_{i_{m}}\in B_{m}, {j_{1}^{y}}=i_{1},\ldots,{j_{m}^{y}}=i_{m},\ldots,{j_{k}^{y}}=i_{k}\right)\\ &=& \binom{n}{k}\prod\limits_{i=1}^{m} {\textsf{P}}(X\in B_{i},Y=y){\textsf{P}}(Y=y)^{k-m}{\textsf{P}}(Y\neq y)^{n-k}\\ &=& \binom{n}{k}{\textsf{P}}(Y=y)^{k} {\textsf{P}}(Y\neq y)^{n-k}\prod\limits_{i=1}^{m} {\textsf{P}}(X\in B_{i}|Y=y). \end{array} $$

This equality implies the validity of all the Lemma assertions in the considered case. □

Proof

of Lemma 3 We can take c = 1 without loss of generality. Indeed, I(X, Y ) = I(X, aY ) for any a≠ 0 since, for each one-to-one mapping ϕ: MT where (M) = (T) = m, one has I(X; ϕ(Y )) = I(X, Y ). Condition (B) is valid in view of Corollary 1 by (Bulinski and Kozhevin 2018). Consider (C). One has PXμ and, consequently, for any \(x\in \mathbb {R}^{d}\) and yM,

$$ 0 \leq P_{X, Y}(A_{x, y, 0}) = {\textsf{P}}(X = x, Y = y) \leq {\textsf{P}}(X = x) = 0. $$

Moreover,

$$ \begin{array}{@{}rcl@{}} \textsf{E} \left| \log \frac{f_{X, Y}(X, Y)}{f_{X}(X) f_{Y}(Y)}\right| &=&\sum\limits_{y \in M} \textsf{E} \left( \left| \log \frac{f_{X, Y}(X, Y)}{f_{X}(X) f_{Y}(Y)}\right| \Big| Y = y \right) {\textsf{P}}(Y = y) \\ &\leq& \textsf{E} \left| \log f_{Y}(Y) \right| + \textsf{E} \left| \log f_{Y|X}(Y|X) \right| = H(Y) + H(Y|X) < \infty. \end{array} $$

Here we employed that 0 ≤ fY |X(y|x) ≤ 1 for any yM and μ-almost all \(x \in \mathbb {R}^{d}\) and 0 ≤ fY(y) = P(Y = y) ≤ 1. Thus, (C) is valid.

Corollary 2.7 in Bulinski and Dimitrov (2019b) implies that, for each ν > 2, ε > 0, one has \(L_{f}(\nu )<\infty \) and \(G_{f}(\varepsilon )<\infty \). It remains to show that \(T_{g_{y}}(\varepsilon ) < \infty \) for yM.

According to formula (4.17) of Bulinski and Dimitrov (2019b)

$$ f_{X}(z) \geq f_{X}(x) e^{-\frac{\|z-x\|^{2}}{2 \lambda_{min} } + \left( {\Sigma}^{-1}(\nu - x), z - x \right) }, $$

where λmin > 0 is the minimal eigenvalue of Σ. Immediately one obtains

$$ g_{1}(z) \geq g_{1}(x) \frac{1 + e^{-(w,x) - b}}{1 + e^{-(w,z) - b}} e^{-\frac{\|z-x\|^{2}}{2 \lambda_{min} } + \left( {\Sigma}^{-1}(\nu - x), z - x \right) }, $$
$$ g_{0}(z) \geq g_{0}(x) \frac{1 + e^{(w,x) + b}}{1 + e^{(w,z) + b}} e^{-\frac{\|z-x\|^{2}}{2 \lambda_{min} } + \left( {\Sigma}^{-1}(\nu - x), z - x \right) }. $$

At first we find the lower bound for \(\frac {1 + e^{-(w,x)-b}}{1 + e^{-(w,z)-b}}\) when ∥xy∥≤ r. Set u = −(w, x) − b. Then

$$ \frac{1 + e^{-(w,x)-b}}{1 + e^{-(w,z)-b}} \geq \frac{1+ e^{-\|w\|r+u}}{1+e^{u}} = 1 -\frac{e^{u}}{1+e^{u}}(1- e^{-\|w\|r})\geq e^{-\|w\|r}. $$
(44)

In a similar way \(\frac {1 + e^{(w,x)+b}}{1 + e^{(w,z)+b}} \geq e^{-\|w\|r}\) whenever ∥xz∥≤ r. Hence, for yM, R > 0, r ∈ (0,R) and \(x,z\in \mathbb {R}^{d}\) such that ∥xz∥≤ r, one has

$$ \begin{array}{@{}rcl@{}} {\int}_{B(x, r)} f_{X,Y}(z, y) dz &\geq& f_{X, Y}(x, y) e^{- \|w\| r-\frac{r^{2}}{2 \lambda_{min} }} {\int}_{B(0, r)} e^{({\Sigma}^{-1}(\nu - x), u ) } du \\ &\geq& f_{X, Y}(x, y) e^{- \|w\| r-\frac{r^{2}}{2 \lambda_{min} }} {\int}_{B(0, r)} (1 + ({\Sigma}^{-1}(\nu - x), u )) du \\ &=& f_{X, Y}(x, y) e^{- \|w\| r-\frac{r^{2}}{2 \lambda_{min} }} r^{d} V_{d} \geq f_{X, Y}(x, y) e^{- \|w\| R-\frac{R^{2}}{2 \lambda_{min} }} r^{d} V_{d}. \end{array} $$

Consequently, for \(C = e^{- \|w\| R -\frac {R^{2}}{2 \lambda _{min} }}\), yM, \(x \in \mathbb {R}^{d}\) and 0 < r < R,

$$ m_{f_{X, Y}}(x, r) \geq C f_{X, Y}(x, y), . $$
(45)

For each ε ∈ (0, 1) and yM,

$$ {\int}_{\mathbb{R}^{d}} f_{X, Y}^{1- \varepsilon}(x, y) dx \leq {\int}_{\mathbb{R}^{d}} f^{1- \varepsilon}_{X}(x) dx < \infty, $$

hence, \( T_{g_{y}}(\varepsilon ,R) = {\int \limits }_{\mathbb {R}^{d}} m_{f_{X, Y}}^{-\varepsilon }(x, R) f_{X, Y}(x, y) dx < \infty . \) Due to Bulinski and Dimitrov (2019b), Lemma 2.5, we can claim that \(T_{g_{y}}(\varepsilon ):=T_{g_{y}}(\varepsilon ,\varepsilon )<\infty \) for all ε small enough. The proof of Lemma is complete. □

Proof

of Lemma 4 Similarly to the Lemma 3 proof, we infer that (C) is valid. Now we turn to (A). Obviously, \(f(x) \leq \frac {1}{(\pi \gamma )^{d}}\) for \(x \in \mathbb {R}^{d}\), therefore \(Q_{f}(\varepsilon ) < \infty \) for ε ∈ (0, 1). Let us prove that \(L_{f}(\nu ) < \infty \) for some ν > 2. We employ linear change of variables in the integral representing Lf(ν) and note that \(\left |\log \| y - x \| -\log \gamma \right |^{\nu } \leq 2^{\nu -1}(\left |\log \| y - x \|\right |^{\nu } + |\log \gamma |^{\nu })\). Then we can claim that \(L_{f}(\nu )<\infty \) if

$$ {\int}_{\mathbb{R}^{2d}} \left|\log \| y - x \|\right|^{\nu} \prod\limits_{i=1}^{d} \frac{1}{\left( 1 + {x_{i}^{2}} \right)} \frac{1}{\left( 1 + {y_{i}^{2}} \right)} d x dy < \infty. $$

Set \(u = \frac {y - x}{\sqrt {2}}, v = \frac {x + y}{\sqrt {2}}\). Then, for u = (u1,…,ud) and v = (v1,…,vd), we can study the convergence of the following integral:

$$ \begin{aligned} &{\int}_{\mathbb{R}^{2d}} \left| \log (\sqrt{2} \| u \|) \right|^{\nu} \prod\limits_{i=1}^{d} \frac{1}{\left( 1 + \frac{(u_{i} - v_{i})^{2}}{2} \right)} \frac{1}{\left( 1 + \frac{(u_{i} + v_{i})^{2}}{2} \right)} du dv\\ &\qquad= (\sqrt{2} \pi)^{d}{\int}_{\mathbb{R}^{d}}|\log (\sqrt{2}\|u\|)|^{\nu}\prod\limits_{i=1}^{d} \frac{1}{{u_{i}^{2}} +2}du \end{aligned} $$

since, for \(z \in \mathbb {R}\), \( I(z) = {\int \limits }_{\mathbb {R}} \frac {1}{\left (1 + \frac {(z - s)^{2}}{2} \right )} \frac {1}{\left (1 + \frac {(z + s)^{2}}{2} \right )} ds = \frac {\sqrt {2} \pi }{z^{2} + 2}. \) Introduce \(v_{i}={u_{i}^{2}}\), i = 1,…,d. Thus, is is enough to show that \(J(d)={\int \limits }_{(0,\infty )^{d}}h_{d}(v)dv<\infty \), where

$$ h_{d}(v):=|\log (v_{1}+{\ldots} +v_{d})|^{\nu}\prod\limits_{i=1}^{d} \frac{1}{v_{i}^{1/2}(v_{i} +2)}, v\in (0,\infty)^{d}. $$

It is easily seen that \(J(1)<\infty \). Consider d ≥ 2. One has J(d) = J1(d) + J2(d), where J1(d) and J2(d) are integrals of hd(v) taken over \(B_{1}:=(0,\infty )^{d} \cap \{v_{1}+\ldots +v_{d} <1\}\) and \(B_{2}:=(0,\infty )^{d} \cap \{v_{1}+\ldots +v_{d} \geq 1\}\), respectively. We write

$$ \begin{array}{@{}rcl@{}} J_{1}(d) &=& {\int}_{B_{1} \cap \{v_{1}<v_{2}+{\ldots} +v_{d}\}} h_{d}(v)dv + {\int}_{B_{1} \cap \{v_{1}\geq v_{2}+{\ldots} +vx_{d}\}} h_{d}(v)dv = J_{1,1}(d)+J_{1,2}(d), \\ J_{1,1}(d) &\leq& {\int}_{v_{1}>0} \frac{|\log v_{1}|^{\nu}}{\sqrt{v_{1}}(v_{1}+2)}dv_{1} {\int}_{v_{2}>0,\ldots,v_{d}>0} \prod\limits_{j=2}^{d}\frac{1}{\sqrt{v_{j}}(v_{j}+2)} dv_{2} {\ldots} dv_{d} <\infty, \\ J_{1,2}(d) &\leq& {\int}_{v_{1}>0}\frac{1}{\sqrt{v_{1}}(v_{1}+2)} dv_{1}{\int}_{v_{2}>0,\ldots,v_{d}>0}\frac{|\log (v_{2}+ {\ldots} +v_{d})|^{\nu}} {{\prod}_{j=2}^{d}\sqrt{v_{j}}(v_{j}+2)} dv_{2}{\ldots} dv_{d} <\infty, \end{array} $$

where we use mathematical induction in d. In a similar way one can verify that \(J_{2}(d)<\infty \). The proof of the finiteness of J(d), for each \(d\in \mathbb {N}\) and ν > 2, is complete, so \(L_{f}(\nu ) < \infty \) for any ν > 2. Now we show that, for yM, \(x \in \mathbb {R}^{d}\), R > 0 and 0 ≤ rR, inequality (2) is valid, where C > 0 does not depend on x, y and r.

Due to Eq. 44, for each yM and \(x,z\in \mathbb {R}^{d}\) such that ∥xz∥≤ r (r ≥ 0),

$$ \frac{f_{X, Y}(z, y)}{f_{X, Y}(x, y)} \geq e^{- \|w\| r} \prod\limits_{i=1}^{d}\frac{1 + \left( \frac{x_{i}-\nu)}{\gamma}\right)^{2}}{1 + \left( \frac{z_{i}-\nu)}{\gamma}\right)^{2}}. $$
(46)

For \(d\in \mathbb {N}\) and \(x,z\in \mathbb {R}^{d}\), set \( F_{d}(x,z):= {\prod }_{i=1}^{d}\frac {1 + {x_{i}^{2}}}{1 + {z_{i}^{2}}}, x=(x_,\ldots ,x_{d}), z=(z_{1},\ldots ,z_{d}). \) Now we will consider \(x,z\in \mathbb {R}^{d}\) such that ∥xz∥≤ γr. Take R < 1/γ. One has

$$ F_{d}(x,z)\geq \prod\limits_{i=1}^{d}\frac{1}{1 + \frac{(\gamma r)^{2}}{1 + {x_{i}^{2}}} + \frac{2 x_{i}}{1 + {x_{i}^{2}}}(z_{i} - x_{i})}. $$

Evidently, \(\frac {1}{s+1}\geq 1-s\) for s > − 1. Thus we get \( F_{d}(x,z)\geq {\prod }_{i=1}^{d} (a_{i} - b_{i}(z_{i}-x_{i})), \) where \(a_{i} = 1-\frac {(\gamma r)^{2}}{1+{x_{i}^{2}}} \geq 1 - (\gamma r)^{2} \geq 1-(\gamma R)^{2}\), \(b_{i} = \frac {2 x_{i}}{1 + {x_{i}^{2}}}\), i = 1,…,d. Using induction one can prove that, for any r ≥ 0, \(A_{i},B_{i}\in \mathbb {R}\), i = 1,…,d,

$$ {\int}_{B(0,r)}\prod\limits_{i=1}^{d}(A_{i}-B_{i}u_{i})du = V_{d} r^{d} \prod\limits_{i=1}^{d} A_{i}. $$

Return to Eq. 3. If ∥xz∥≤ rR, where 0 < R < 1/γ, then

$$ {\int}_{B(x, r)} f(z, y) dz \geq f(x, y) e^{-\|w\|r}V^{d} r^{d}\left( 1-(\gamma R)^{2}\right)^{d}. $$

Hence Eq. 2 is valid with \( C = \left (1-(\gamma R)^{2}\right )^{d} e^{-\|w\|R}\). Therefore, for each ε ∈ (0, 1) and yM,

$$ {\int}_{\mathbb{R}^{d}} f_{X, Y}^{1- \varepsilon}(x, y) dx \leq {\int}_{\mathbb{R}^{d}} f_{X}^{1- \varepsilon}(x) dx < \infty, $$

and \( T_{g_{y}}(\varepsilon ,R) = {\int \limits }_{\mathbb {R}^{d}} m_{f_{X, Y}}^{-\varepsilon }(x, R) f_{X, Y}(x, y) dx < \infty \).

It remains to verify condition (B). The function \( g_{1}(x) = \frac {1}{1 + e^{-(w, x) - b}} f(x) \) is strictly positive. Moreover,

$$ \begin{array}{@{}rcl@{}} \nabla g_{1}(x) &=& \frac{f(x)}{(1 + e^{-(w, x) - b})^{2}} e^{-(w,x)-b} w + \frac{1}{1 + e^{-(w, x) - b}} \nabla f(x), \\ \nabla f(x) &=& - \frac{2}{\gamma^{2}} f(x) \left( \frac{x_{1} - \nu}{\left( 1 + \left( \frac{x_{1} - \nu}{\gamma}\right)^{2} \right)}, \ldots, \frac{x_{d} - \nu}{\left( 1 + \left( \frac{x_{d} - \nu}{\gamma}\right)^{2} \right)} \right), \\ \|\nabla g_{1}(x)\| &\leq& \frac{1}{4} |f(x)| \|w\| + \frac{1}{1 + e^{-(w, x) - b}} \|\nabla f(x)\| \leq \frac{1}{4(\pi \gamma)^{d}} \|w\| + \| \nabla f(x) \|. \end{array} $$

The function \(s(u) = \frac {u}{(1 + u^{2})^{2}}\) is continuous and \(\lim _{u \to \infty } s(u) = \lim _{u \to -\infty } s(u) = 0\), therefore \( \max \limits _{u \in \mathbb {R}} |(u - \nu )|\left (1 + ((u - \nu )/\gamma )^{2}\right )^{-1} < \infty . \) Thus \(\max \limits _{x \in \mathbb {R}^{d} }\|\nabla g_{1}(x)\| < \infty \) and g1(x) satisfies the Lipschitz condition with some constant C0 > 0 for each \(x \in \mathbb {R}^{d}\). According to Remark 1 in Bulinski and Kozhevin (2018) we conclude that g1 is C0-constricted. The same reasoning is valid for g0.

For i.i.d. random variables X1,…,Xd, the Minkowski inequality yields

$$ \begin{array}{@{}rcl@{}} \left( \textsf{E} |\log f(X)|^{2+\varepsilon} \right)^{1/(2+\varepsilon)} &\leq& \sum\limits_{i=1}^{d} \left( \textsf{E} \left| \log f_{X_{i}}(X_{i}) \right|^{2+\varepsilon} \right)^{1/(2+\varepsilon)}\\ &=& d \left( \textsf{E} \left| \log f_{X_{1}}(X_{1}) \right|^{2+\varepsilon} \right)^{1/(2+\varepsilon)}\\&=&d \left( {\int}_{\mathbb{R}} \frac{1}{\pi \left( 1 + x^{2} \right)} \left|\log \pi \gamma \left( 1 + x^{2} \right) \right|^{2+\varepsilon} dx\right)^{1/(2+\varepsilon)} < \infty. \end{array} $$

Thus (B) is satisfied. The proof of Lemma is complete. □

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bulinski, A., Kozhevin, A. Statistical Estimation of Mutual Information for Mixed Model. Methodol Comput Appl Probab 23, 123–142 (2021). https://doi.org/10.1007/s11009-020-09802-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11009-020-09802-0

Keywords

Mathematics Subject Classification (2010)

Navigation