当前位置: X-MOL 学术arXiv.cs.IT › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Computing Accurate Probabilistic Estimates of One-D Entropy from Equiprobable Random Samples
arXiv - CS - Information Theory Pub Date : 2021-02-25 , DOI: arxiv-2102.12675
Hoshin V Gupta, Mohammed Reza Ehsani, Tirthankar Roy, Maria A Sans-Fuentes, Uwe Ehret, Ali Behrangi

We develop a simple Quantile Spacing (QS) method for accurate probabilistic estimation of one-dimensional entropy from equiprobable random samples, and compare it with the popular Bin-Counting (BC) method. In contrast to BC, which uses equal-width bins with varying probability mass, the QS method uses estimates of the quantiles that divide the support of the data generating probability density function (pdf) into equal-probability-mass intervals. Whereas BC requires optimal tuning of a bin-width hyper-parameter whose value varies with sample size and shape of the pdf, QS requires specification of the number of quantiles to be used. Results indicate, for the class of distributions tested, that the optimal number of quantile-spacings is a fixed fraction of the sample size (empirically determined to be ~0.25-0.35), and that this value is relatively insensitive to distributional form or sample size, providing a clear advantage over BC since hyperparameter tuning is not required. Bootstrapping is used to approximate the sampling variability distribution of the resulting entropy estimate, and is shown to accurately reflect the true uncertainty. For the four distributional forms studied (Gaussian, Log-Normal, Exponential and Bimodal Gaussian Mixture), expected estimation bias is less than 1% and uncertainty is relatively low even for very small sample sizes. We speculate that estimating quantile locations, rather than bin-probabilities, results in more efficient use of the information in the data to approximate the underlying shape of an unknown data generating pdf.

中文翻译:

从等概率随机样本计算一维熵的准确概率估计

我们开发了一种简单的分位数间距(QS)方法,用于从等概率随机样本中准确估计一维熵的概率,并将其与流行的Bin-Counting(BC)方法进行比较。与BC使用具有变化的概率质量的等宽区间的BC相比,QS方法使用分位数的估计,这些分位数将生成数据的概率密度函数(pdf)的支持划分为等概率质量间隔。BC要求对值宽度随样本大小和pdf形状而变化的bin宽度超参数进行最佳调整,而QS要求指定要使用的分位数。结果表明,对于所测试的分布类别,分位数间距的最佳数量是样本大小的固定分数(根据经验确定为〜0.25-0.35),并且该值对分布形式或样本量相对不敏感,与BC相比具有明显的优势,因为不需要超参数调整。自举法用于估计所得熵估计值的采样变异性分布,并被显示为准确反映了真实的不确定性。对于研究的四种分布形式(高斯,对数正态,指数和双峰高斯混合),即使对于很小的样本量,预期的估计偏差也小于1%,不确定性相对较低。我们推测,估计分位数位置而不是bin概率会导致更有效地利用数据中的信息来近似估算生成pdf的未知数据的基本形状。与BC相比,它提供了明显的优势,因为不需要超参数调整。自举法用于估计所得熵估计值的采样变异性分布,并被显示为准确反映了真实的不确定性。对于研究的四种分布形式(高斯,对数正态,指数和双峰高斯混合),即使对于很小的样本量,预期的估计偏差也小于1%,不确定性相对较低。我们推测,估计分位数位置而不是bin概率会导致更有效地利用数据中的信息来近似估算生成pdf的未知数据的基本形状。与BC相比,它提供了明显的优势,因为不需要超参数调整。自举法用于估计所得熵估计值的采样变异性分布,并被显示为准确反映了真实的不确定性。对于研究的四种分布形式(高斯,对数正态,指数和双峰高斯混合),即使对于很小的样本量,预期的估计偏差也小于1%,不确定性相对较低。我们推测,估计分位数位置而不是bin概率会导致更有效地利用数据中的信息来近似估算生成pdf的未知数据的基本形状。并且可以准确反映出真实的不确定性。对于研究的四种分布形式(高斯,对数正态,指数和双峰高斯混合),即使对于很小的样本量,预期的估计偏差也小于1%,不确定性相对较低。我们推测,估计分位数位置而不是bin概率会导致更有效地利用数据中的信息来近似估算生成pdf的未知数据的基本形状。并且可以准确反映出真实的不确定性。对于研究的四种分布形式(高斯,对数正态,指数和双峰高斯混合),即使对于很小的样本量,预期的估计偏差也小于1%,不确定性相对较低。我们推测,估计分位数位置而不是bin概率会导致更有效地利用数据中的信息来近似估算生成pdf的未知数据的基本形状。
更新日期:2021-02-26
down
wechat
bug