Co-clustering algorithms for distributional data with automated variable weighting,Information Sciences

当前位置： X-MOL 学术 › Inform. Sci. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Co-clustering algorithms for distributional data with automated variable weighting
Information Sciences ( IF 8.1 ) Pub Date : 2020-11-28 , DOI: 10.1016/j.ins.2020.11.018
Francisco de A.T. De Carvalho , Antonio Balzanella , Antonio Irpino , Rosanna Verde

This paper is concerned with the co-clustering of distribution-valued data, that is, the simultaneous partitioning of rows and columns of an input data table, the elements of which are distributions (or histograms) representing aggregate data. The first proposed method extends the double k-means algorithm to distributional data. The $L_{2}$ Wasserstein distance, also known as Mallow’s distance, is used to compare distributions. To consider the different relevance of the variables characterizing the clusters, four variants of adaptive distributional double k-means are proposed. Accordingly, in the co-clustering procedure, an additional step is introduced to compute the relevance weights associated with the variables. In particular, each of the four algorithms provides i) a set of weights for the variables; ii) different sets of weights for the variables, one for each cluster (cluster-wise); iii) a double set of weights for the variables according to the decomposition of the $L_{2}$ Wasserstein distance into two components; iv) different double sets of weights for the variables and distance components, one for each cluster (cluster-wise). Applications using simulated and real data demonstrate the effectiveness of the proposed algorithms and the contribution of the relevance weights to the co-clustering procedure according to the structure of the data.

中文翻译：

具有自动可变权重的分布数据共聚算法

本文涉及分布值数据的共聚，即输入数据表的行和列的同时分区，其元素是代表聚合数据的分布（或直方图）。第一个提出的方法将双k均值算法扩展到分布数据。的 ${大号}_{2}$ Wasserstein距离（也称为Mallow距离）用于比较分布。为了考虑表征聚类的变量的不同相关性，提出了自适应分布双k均值的四个变体。因此，在共聚过程中，引入了附加步骤以计算与变量相关联的相关权重。特别地，四种算法中的每一种都提供：i）一组变量权重；ii）变量的权重不同，每个集群一个（按集群）；iii）根据变量的分解对变量进行双重赋权 ${大号}_{2}$ Wasserstein距离分为两个部分：iv）变量和距离分量的权重的不同双组，每个集群一个（集群方式）。使用模拟数据和真实数据的应用程序证明了所提出算法的有效性以及相关权重对根据数据结构的共聚过程的贡献。

更新日期：2020-12-14

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>