当前位置: X-MOL 学术Data Min. Knowl. Discov. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A unified view of density-based methods for semi-supervised clustering and classification
Data Mining and Knowledge Discovery ( IF 4.8 ) Pub Date : 2020-07-27 , DOI: 10.1007/s10618-019-00651-1
Jadson Castro Gertrudes 1 , Arthur Zimek 2 , Jörg Sander 3 , Ricardo J G B Campello 4
Affiliation  

Semi-supervised learning is drawing increasing attention in the era of big data, as the gap between the abundance of cheap, automatically collected unlabeled data and the scarcity of labeled data that are laborious and expensive to obtain is dramatically increasing. In this paper, we first introduce a unified view of density-based clustering algorithms. We then build upon this view and bridge the areas of semi-supervised clustering and classification under a common umbrella of density-based techniques. We show that there are close relations between density-based clustering algorithms and the graph-based approach for transductive classification. These relations are then used as a basis for a new framework for semi-supervised classification based on building-blocks from density-based clustering. This framework is not only efficient and effective, but it is also statistically sound. In addition, we generalize the core algorithm in our framework, HDBSCAN*, so that it can also perform semi-supervised clustering by directly taking advantage of any fraction of labeled data that may be available. Experimental results on a large collection of datasets show the advantages of the proposed approach both for semi-supervised classification as well as for semi-supervised clustering.

中文翻译:

基于密度的半监督聚类和分类方法的统一视图

在大数据时代,半监督学习正在引起越来越多的关注,因为大量廉价,自动收集的未标记数据与获取费力且昂贵的标记数据的稀缺性之间的差距正在急剧增加。在本文中,我们首先介绍基于密度的聚类算法的统一视图。然后,我们基于此观点建立桥梁,并在基于密度的技术的通用保护下架起了半监督聚类和分类的领域。我们表明基于密度的聚类算法和基于图的方法进行转导分类之间存在密切的关系。这些关系然后用作基于基于密度的聚类的构建块的半监督分类新框架的基础。这个框架不仅有效而且有效,但从统计学上讲也是合理的。此外,我们在我们的框架HDBSCAN *中推广了核心算法,因此它也可以通过直接利用任何可用的标记数据部分来执行半监督聚类。在大量数据集上的实验结果表明,该方法在半监督分类以及半监督聚类中均具有优势。
更新日期:2020-07-27
down
wechat
bug