当前位置: X-MOL 学术J. Stat. Comput. Simul. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Evaluating reliability of tree-patterns in extreme-K categorical samples problems
Journal of Statistical Computation and Simulation ( IF 1.1 ) Pub Date : 2021-07-21 , DOI: 10.1080/00949655.2021.1951266
Elizabeth Chou, Yin-Chen Hsieh, Sabrina Enriquez, Fushing Hsieh

Exploratory Data Analysis (EDA) approaches are adopted to address the difficult extreme-K categorical sample problem. Due to observed data's categorical nature, all comparisons among populations are performed by comparing their distributions in the form of a histogram with symbolic bins. A distance measure is designed to evaluate the discrepancy between two symbol-based histograms to facilitate Hierarchical Clustering (HC) algorithms. The resultant binary HC-tree then serves as the basis for our EDA task of discovering tree-patterns of interest. Since each population-leaf's location within a binary HC-tree's geometry is expressed through a binary code sequence, a binary code segment characterizes all commonly shared tree-patterns for all members. We then generate a large ensemble of mimicries of the observed dataset based on multinomial distributions and construct a large ensemble of binary HC-trees. Upon each identified tree-pattern which we determined based on the observed dataset, we evaluate its reliability and uncertainty through two histograms.



中文翻译:

评估极端 K 分类样本问题中树模式的可靠性

采用探索性数据分析 (EDA) 方法来解决困难的极 K 分类样本问题。由于观察到的数据的分类性质,种群之间的所有比较都是通过将它们的分布以直方图的形式与符号箱进行比较来进行的。距离度量旨在评估两个基于符号的直方图之间的差异,以促进分层聚类 (HC) 算法。生成的二叉 HC 树然后作为我们发现感兴趣的树模式的 EDA 任务的基础。由于二叉 HC 树几何结构中每个种群叶的位置是通过二进制代码序列表示的,因此二进制代码段表征了所有成员的所有共同共享的树模式。然后,我们基于多项式分布生成大量观察数据集的模仿,并构建大量二叉 HC 树。在我们根据观察到的数据集确定的每个识别的树模式上,我们通过两个直方图评估其可靠性和不确定性。

更新日期:2021-07-21
down
wechat
bug