当前位置: X-MOL 学术Adv. Theory Simul. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Stochastic Estimations of the Total Number of Classes for a Clustering having Extremely Large Samples to be Included in the Clustering Engine
Advanced Theory and Simulations ( IF 2.9 ) Pub Date : 2021-03-26 , DOI: 10.1002/adts.202000301
Keishu Utimula 1 , Genki I. Prayogo 1 , Kousuke Nakano 2 , Kenta Hongo 3, 4, 5 , Ryo Maezono 2
Affiliation  

Numerous reports have elucidated the classification of a large amount of data using various clustering techniques. However, an increase in data size hinders the applicability of these methods. Here, it is investigated how to deal with the exploding number of possibilities to be sorted into irreducible classes by using a clustering technique when its input capacity cannot accommodate the total number of possibilities. This can be exemplified by atomic substitutions in the supercell modeling of alloys. The number of possibilities is sometimes equal to trillions, which is extremely large to be accommodated in a cluster. Thus, it is not practically feasible to identify directly how many irreducible classes exist even though several techniques are available to perform the clustering. In this regard, a stochastic framework is developed to avoid the shortage limitations, providing a method to estimate the total number of irreducible classes (the order of classes), as a statistical estimate. The main conclusion is that the statistical variation of the number of classes, at each sampling trial, can serve as a promising measure to estimate the total number of irreducible classes. Characteristics of this approach is also discussed by comparing with the conventional one based on Polya's theorem.

中文翻译:

包含在聚类引擎中的样本非常大的聚类的类总数的随机估计

许多报告阐明了使用各种聚类技术对大量数据进行分类的方法。但是,数据大小的增加阻碍了这些方法的适用性。在这里,研究了当聚类技术的输入容量无法容纳全部可能性时,如何处理通过分类成不可归约类的可能性激增数。这可以通过合金的超级电池建模中的原子取代来举例说明。可能性的数量有时等于数万亿,这对于容纳在集群中来说是非常大的。因此,即使有几种技术可用于执行聚类,直接识别存在多少个不可约类在实践上也不可行。在这方面,为了避免短缺局限性,开发了一个随机框架,提供了一种方法来估计不可归约类的总数(类的顺序),作为统计估计。主要结论是,在每个抽样试验中,班级数量的统计变化可以作为一种有希望的方法,用于估计不可归约班级的总数。通过与基于Polya定理的传统方法进行比较,还讨论了这种方法的特性。
更新日期:2021-05-05
down
wechat
bug