Neural Networks ( IF 7.8 ) Pub Date : 2020-07-03 , DOI: 10.1016/j.neunet.2020.06.022 Aluizio F R Araújo 1 , Victor O Antonino 1 , Karina L Ponce-Guevara 1
A surge in the availability of data from multiple sources and modalities is correlated with advances in how to obtain, compress, store, transfer, and process large amounts of complex high-dimensional data. The clustering challenge increases with the growth of data dimensionality which decreases the discriminate power of the distance metrics. Subspace clustering aims to group data drawn from a union of subspaces. In such a way, there is a large number of state-of-the-art approaches and we divide them into families regarding the method used in the clustering. We introduce a soft subspace clustering algorithm, a Self-organizing Map (SOM) with a time-varying structure, to cluster data without any prior knowledge of the number of categories or of the neural network topology, both determined during the training process. The model also assigns proper relevancies (weights) to different dimensions, capturing from the learning process the influence of each dimension on uncovering clusters. We employ a number of real-world datasets to validate the model. This algorithm presents a competitive performance in a diverse range of contexts among them data mining, gene expression, multi-view, computer vision and text clustering problems which include high-dimensional data. Extensive experiments suggest that our method very often outperforms the state-of-the-art approaches in all types of problems considered.
中文翻译:
自组织子空间聚类,用于处理高维和多视图数据。
来自多个来源和模态的数据可用性的激增与如何获取,压缩,存储,传输和处理大量复杂的高维数据方面的进步有关。聚类挑战随着数据维数的增加而增加,这降低了距离度量的区分能力。子空间聚类旨在对从子空间的并集中提取的数据进行分组。通过这种方式,存在大量的最新方法,关于聚类中使用的方法,我们将它们分为几类。我们引入了一种软子空间聚类算法(具有时变结构的自组织映射(SOM))来聚类数据,而无需在训练过程中确定类别或神经网络拓扑的任何先验知识。该模型还为不同的维度分配了适当的权重(权重),从学习过程中捕获了每个维度对发现集群的影响。我们使用了许多现实世界的数据集来验证模型。该算法在各种上下文中都表现出竞争优势,其中包括数据挖掘,基因表达,多视图,计算机视觉和包括高维数据的文本聚类问题。大量的实验表明,在所考虑的所有类型的问题中,我们的方法通常都优于最新方法。该算法在各种上下文中都表现出竞争优势,其中包括数据挖掘,基因表达,多视图,计算机视觉和包括高维数据的文本聚类问题。大量的实验表明,在所考虑的所有类型的问题中,我们的方法通常都优于最新方法。该算法在多种环境中都表现出竞争优势,其中包括数据挖掘,基因表达,多视图,计算机视觉和包括高维数据在内的文本聚类问题。大量的实验表明,在所考虑的所有类型的问题中,我们的方法通常都优于最新方法。