当前位置: X-MOL 学术J. Classif. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
MDCGen: Multidimensional Dataset Generator for Clustering
Journal of Classification ( IF 2 ) Pub Date : 2019-04-23 , DOI: 10.1007/s00357-019-9312-3
Félix Iglesias , Tanja Zseby , Daniel Ferreira , Arthur Zimek

We present a tool for generating multidimensional synthetic datasets for testing, evaluating, and benchmarking unsupervised classification algorithms. Our proposal fills a gap observed in previous approaches with regard to underlying distributions for the creation of multidimensional clusters. As a novelty, normal and non-normal distributions can be combined for either independently defining values feature by feature (i.e., multivariate distributions) or establishing overall intra-cluster distances. Being highly flexible, parameterizable, and randomizable, MDCGen also implements classic pursued features: (a) customization of cluster-separation, (b) overlap control, (c) addition of outliers and noise, (d) definition of correlated variables and rotations, (e) flexibility for allowing or avoiding isolation constraints per dimension, (f) creation of subspace clusters and subspace outliers, (g) importing arbitrary distributions for the value generation, and (h) dataset quality evaluations, among others. As a result, the proposed tool offers an improved range of potential datasets to perform a more comprehensive testing of clustering algorithms.

中文翻译:

MDCGen:用于聚类的多维数据集生成器

我们提出了一种生成多维合成数据集的工具,用于测试、评估和基准测试无监督分类算法。我们的提议填补了在先前方法中观察到的关于创建多维集群的潜在分布的空白。作为一种新颖性,正态分布和非正态分布可以结合起来,用于逐个特征地独立定义值(即多元分布)或建立整体集群内距离。由于高度灵活、可参数化和可随机化,MDCGen 还实现了经典的追求特征:(a) 集群分离的定制,(b) 重叠控制,(c) 添加异常值和噪声,(d) 相关变量和旋转的定义, (e) 允许或避免每个维度的隔离约束的灵活性,(f) 创建子空间集群和子空间异常值,(g) 为值生成导入任意分布,以及 (h) 数据集质量评估等。因此,所提出的工具提供了一系列改进的潜在数据集,以对聚类算法进行更全面的测试。
更新日期:2019-04-23
down
wechat
bug