Variations on the Clustering Algorithm BIRCH,Big Data Research

当前位置： X-MOL 学术 › Big Data Res. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Variations on the Clustering Algorithm BIRCH
Big Data Research ( IF 3.5 ) Pub Date : 2017-10-18 , DOI: 10.1016/j.bdr.2017.09.002
Boris Lorbeer , Ana Kosareva , Bersant Deva , Dženan Softić , Peter Ruppel , Axel Küpper

Clustering algorithms are recently regaining attention with the availability of large datasets and the rise of parallelized computing architectures. However, most clustering algorithms suffer from two drawbacks: they do not scale well with increasing dataset sizes and often require proper parametrization which is usually difficult to provide. A very important example is the cluster count, a parameter that in many situations is next to impossible to assess. In this paper we present A-BIRCH, an approach for automatic threshold estimation for the BIRCH clustering algorithm. This approach computes the optimal threshold parameter of BIRCH from the data, such that BIRCH does proper clustering even without the global clustering phase that is usually the final step of BIRCH. This is possible if the data satisfies certain constraints. If those constraints are not satisfied, A-BIRCH will issue a pertinent warning before presenting the results. This approach renders the final global clustering step of BIRCH unnecessary in many situations, which results in two advantages. First, we do not need to know the expected number of clusters beforehand. Second, without the computationally expensive final clustering, the fast BIRCH algorithm will become even faster. For very large data sets, we introduce another variation of BIRCH, which we call MBD-BIRCH, which is of particular advantage in conjunction with A-BIRCH but is independent from it and also of general benefit.

中文翻译：

聚类算法BIRCH的变体

最近，随着大型数据集的可用性和并行计算架构的兴起，聚类算法引起了人们的关注。但是，大多数聚类算法都有两个缺点：随着数据集大小的增加，它们不能很好地扩展，并且通常需要适当的参数化，这通常很难提供。一个非常重要的例子是簇数，在许多情况下几乎无法评估该参数。在本文中，我们介绍了A-BIRCH，这是BIRCH聚类算法的自动阈值估计方法。该方法从数据中计算出BIRCH的最佳阈值参数，从而即使没有通常通常是BIRCH的最后步骤的全局聚类阶段，BIRCH也会进行适当的聚类。如果数据满足某些约束，则有可能。如果不满足这些约束条件，A-BIRCH将在显示结果之前发出相关警告。这种方法使得BIRCH的最终全局聚类步骤在许多情况下都是不必要的，这带来了两个好处。首先，我们不需要事先知道群集的预期数量。其次，如果没有计算上昂贵的最终聚类，快速的BIRCH算法将变得更快。对于非常大的数据集，我们介绍了BIRCH的另一种形式，称为MBD-BIRCH，它与A-BIRCH结合使用具有特殊优势，但与它无关，并且也具有普遍利益。我们不需要事先知道集群的预期数量。其次，如果没有计算上昂贵的最终聚类，快速的BIRCH算法将变得更快。对于非常大的数据集，我们介绍了BIRCH的另一种形式，称为MBD-BIRCH，它与A-BIRCH结合使用具有特殊优势，但与它无关，并且也具有普遍利益。我们不需要事先知道集群的预期数量。其次，如果没有计算上昂贵的最终聚类，快速的BIRCH算法将变得更快。对于非常大的数据集，我们介绍了BIRCH的另一种形式，称为MBD-BIRCH，它与A-BIRCH结合使用具有特殊优势，但与它无关，并且也具有普遍利益。

更新日期：2017-10-18

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文