当前位置: X-MOL 学术Stat. Anal. Data Min. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
MR plot: A big data tool for distinguishing distributions
Statistical Analysis and Data Mining ( IF 2.1 ) Pub Date : 2020-06-09 , DOI: 10.1002/sam.11464
Omid M. Ardakani 1 , Majid Asadi 2, 3 , Nader Ebrahimi 4 , Ehsan S. Soofi 5
Affiliation  

Big data enables reliable estimation of continuous probability density, cumulative distribution, survival, hazard rate, and mean residual functions (MRFs). We illustrate that plot of the MRF provides the best resolution for distinguishing between distributions. At each point, the MRF gives the mean excess of the data beyond the threshold. Graph of the empirical MRF, called here the MR plot, provides an effective visualization tool. A variety of theoretical and data driven examples illustrate that MR plots of big data preserve the shape of the MRF and complex models require bigger data. The MRF is an optimal predictor of the excess of the random variable. With a suitable prior, the expected MRF gives the Bayes risk in the form of the entropy functional of the survival function, called here the survival entropy. We show that the survival entropy is dominated by the standard deviation (SD) and the equality between the two measures characterizes the exponential distribution. The empirical survival entropy provides a data concentration statistic which is strongly consistent, easy to compute, and less sensitive than the SD to heavy tailed data. An application uses the New York City Taxi database with millions of trip times to illustrate the MR plot as a powerful tool for distinguishing distributions.

中文翻译:

MR图:用于区分分布的大数据工具

大数据可以可靠地估计连续概率密度,累积分布,生存率,危险率和平均残差函数(MRF)。我们说明了MRF的图提供了区分两种分布的最佳分辨率。在每个点,MRF都会提供超出阈值的平均数据过量。经验MRF的图(这里称为MR图)提供了有效的可视化工具。各种理论和数据驱动的示例说明,大数据的MR图可以保留MRF的形状,而复杂的模型则需要更大的数据。MRF是随机变量过量的最佳预测因子。在适当的先验条件下,预期的MRF以生存函数的熵函数的形式给贝叶斯风险,此处称为生存熵。我们表明,生存熵受标准偏差(SD)支配,并且两个度量之间的相等性表征了指数分布。经验生存熵提供了一个数据集中统计量,该统计量高度一致,易于计算并且对重尾数据比SD敏感。应用程序使用具有数百万次旅行时间的纽约市出租车数据库来说明MR图,这是区分分布的强大工具。
更新日期:2020-06-09
down
wechat
bug