当前位置: X-MOL 学术Inform. Sci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
SEND: A novel dissimilarity metric using ensemble properties of the feature space for clustering numerical data
Information Sciences ( IF 8.1 ) Pub Date : 2021-05-28 , DOI: 10.1016/j.ins.2021.05.059
Gaurav Mishra , Amit Kumar Kar , Amaresh Chandra Mishra , Sraban Kumar Mohanty , M.K. Panda

The clustering is an unsupervised learning technique for grouping the unlabeled data based on the proximity between the data points. Therefore, the performance of clustering techniques mainly depends on the proximity measures. The computation of dissimilarity in high dimensional and noisy datasets as well as datasets with imbalanced feature scale, which appear in various applications, is a challenging task. To counter these challenges, we propose a new distance metric to compute the dissimilarity between data points by combining the ensemble properties, entropy and weight information of feature vectors. We consider the statistical information and entropy along each features to compute the dissimilarity between the points. Then each feature is associated with weight based on its distribution information. The proposed Similarity measure based on Entropy for Numerical Datasets (SEND), is free from any domain specific parameters and there are no underlying assumptions about the distribution of the data. The proposed metric is applied on different type of clustering techniques to evaluate its performance. Experimental analyses on synthetic as well as real datasets demonstrate the efficacy of the proposed metric in terms of cluster quality, accuracy, execution time, robustness against noise and its ability to handle the high dimension datasets.



中文翻译:

SEND:使用特征空间的集合属性对数值数据进行聚类的新型相异度量

聚类是一种无监督学习技术,用于根据数据点之间的接近程度对未标记的数据进行分组。因此,聚类技术的性能主要取决于邻近度量。在各种应用中出现的高维和嘈杂数据集以及具有不平衡特征尺度的数据集的差异计算是一项具有挑战性的任务。为了应对这些挑战,我们提出了一种新的距离度量,通过结合特征向量的集合属性、熵和权重信息来计算数据点之间的差异。我们考虑每个特征的统计信息和熵来计算点之间的差异。然后每个特征根据其分布信息与权重相关联。建议的基于数值数据集熵 (SEND) 的相似性度量不受任何领域特定参数的影响,并且没有关于数据分布的潜在假设。所提出的指标应用于不同类型的聚类技术以评估其性能。对合成数据集和真实数据集的实验分析证明了所提出的度量在集群质量、准确性、执行时间、抗噪声鲁棒性及其处理高维数据集的能力方面的有效性。

更新日期:2021-06-20
down
wechat
bug