当前位置: X-MOL 学术Data Min. Knowl. Discov. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
BROCCOLI: overlapping and outlier-robust biclustering through proximal stochastic gradient descent
Data Mining and Knowledge Discovery ( IF 2.8 ) Pub Date : 2021-08-11 , DOI: 10.1007/s10618-021-00787-z
Sibylle Hess 1 , Michiel Hochstenbach 1 , Gianvito Pio 2, 3 , Michelangelo Ceci 2, 3, 4
Affiliation  

Matrix tri-factorization subject to binary constraints is a versatile and powerful framework for the simultaneous clustering of observations and features, also known as biclustering. Applications for biclustering encompass the clustering of high-dimensional data and explorative data mining, where the selection of the most important features is relevant. Unfortunately, due to the lack of suitable methods for the optimization subject to binary constraints, the powerful framework of biclustering is typically constrained to clusterings which partition the set of observations or features. As a result, overlap between clusters cannot be modelled and every item, even outliers in the data, have to be assigned to exactly one cluster. In this paper we propose Broccoli, an optimization scheme for matrix factorization subject to binary constraints, which is based on the theoretically well-founded optimization scheme of proximal stochastic gradient descent. Thereby, we do not impose any restrictions on the obtained clusters. Our experimental evaluation, performed on both synthetic and real-world data, and against 6 competitor algorithms, show reliable and competitive performance, even in presence of a high amount of noise in the data. Moreover, a qualitative analysis of the identified clusters shows that Broccoli may provide meaningful and interpretable clustering structures.



中文翻译:

BROCCOLI:通过近端随机梯度下降的重叠和异常稳健双聚类

受二元约束的矩阵三分解是一个多功能且强大的框架,用于同时聚类观察和特征,也称为双聚类。双聚类的应用包括高维数据的聚类和探索性数据挖掘,其中最重要的特征的选择是相关的。不幸的是,由于缺乏适用于受二元约束优化的方法,强大的双聚类框架通常仅限于划分观察或特征集的聚类。因此,无法对集群之间的重叠进行建模,并且每个项目,甚至数据中的异常值,都必须准确地分配给一个集群。在本文中,我们建议西兰花,一种受二元约束的矩阵分解优化方案,它基于理论上有充分根据的近端随机梯度下降优化方案。因此,我们不对获得的集群施加任何限制。我们对合成数据和真实数据以及 6 种竞争对手算法进行的实验评估显示出可靠且具有竞争力的性能,即使数据中存在大量噪声也是如此。此外,对已识别聚类的定性分析表明,西兰花可能提供有意义且可解释的聚类结构。

更新日期:2021-08-11
down
wechat
bug