A density-based approach for querying informative constraints for clustering

doi:10.1016/j.eswa.2020.113690

Expert Systems with Applications

Volume 161, 15 December 2020, 113690

https://doi.org/10.1016/j.eswa.2020.113690 Get rights and content

Highlights

•
A density-based approach is proposed for beneficial constraints selection.
•
Constraints are selected based on the impurities of data points.
•
Constraints are selected from the boundary and skeleton of clusters.

Abstract

During the last years, constrained clustering has emerged as an interesting direction in machine learning research. With constrained clustering, the quality of results can be improved by using constraints if a high-quality set of constraints is selected. Querying beneficial constraints is a challenging task because there is no metric for measuring the quality of constraints before clustering. A new method is proposed in this study that estimates density and impurity of data points on different adjacency distances and calculates centrality for each data point by applying a density tracking approach on the obtained densities. The obtained information is then used to select a set of high-quality constraints. Multi-resolution density analysis to more accurately estimate the point-point relationship of data, data density tracking in order to estimate the impurity and centrality of data, and selection of constraints from skeleton of clusters in order to discover the intrinsic structure of data can be mentioned as the most important contributions of this study. To verify the effectiveness of the proposed method, we conducted a series of experiments on real data sets. The obtained results show that the proposed algorithm can improve the clustering process compare with some recent reference algorithms.

Introduction

Clustering is an important task in the process of knowledge discovery in data mining. Clustering has wide applications in economic science, computer vision (Saha, Alok, & Ekbal, 2016), healthcare, information retrieval (Janani & Vijayarani, 2019) and the world wide web. In the past ten years, the problem of clustering with side information (known as constrained clustering) has become an active research direction to improve the quality of the results by integrating knowledge to the algorithms (Basu et al., 2008, Basu et al., 2004, Yan et al., 2006, Yin et al., 2010, Yeung and Chang, 2007, Grira et al., 2008, Ye et al., 2015, Schwenker and Trentin, 2014, Faur and Schwenker, 2014, Junhua et al., 2019, Śmieja et al., 2017). Constrained clustering uses a small set of side information to obtain the expected partitioning of data. Must-link (ML) and cannot-link (CL) constraints are the most widely used forms of side information in constrained clustering (Basu et al., 2008). A must-link constraint indicates that two points of the data set should be grouped in the same cluster while a cannot-link constraint imposes that the points should be grouped in different clusters (See Fig. 1).

Beside with the researching of the constrained clustering algorithm, the problem of choosing good constraints for constrained clustering is a crucial task. With a data set including n points, it can produce a lot of candidates for constraints (i.e. $\frac{n (n + 1)}{2}$ pairs), even some poorly chosen constraints can lead to degrading the performance of clustering processes (Davidson, Wagstaff, & Basu, 2006). Moreover, it is very difficult to evaluate how the quality of a set of constraints are. In fact, with the same constrained clustering algorithm, the quality of the clustering process heavily depends on the quality of constraints that are given. Fig. 2 shows the relationship between constrained clustering and active learning problem in the context of expecting the best side information for constrained clustering. In recent years, the topic of active learning for constraints selection from users have been attracted a lot of attention. Many algorithms have been done for the problem of choosing the good constraints for constrained clustering, we can cite here the work of Grira et al. for fuzzy C-Means clustering in Grira et al. (2008), the work of Basu et al. for K-Means clustering in Basu et al. (2004), the work of Vu et al. for all kinds of constrained clustering in Vu, Labroche, and Bouchon-Meunier, 2012, the work of Abin et al. in (Abin and Beigy, 2014, Abin and Beigy, 2015, Abin, 2019).

Most of the methods in the field of constraint selection have used the idea that constraints that resolve the ambiguity between clusters are useful, and thus have tried to select constraints from the boundaries of clusters. They have ignored the fact that the constraints that provide information about the skeleton of clusters can also be very useful. Constraints that provide useful information about the skeleton of clusters allow clustering algorithms to better discover complex-shaped clusters. In this work, we address the problem of beneficial constraints selection by considering three key assumptions on the usefulness of constraints based on the density relationship between data points. Based on these assumptions, we select constraints that provide helpful information about the boundary and skeleton of clusters. The proposed method estimates the density and impurity of data points on different adjacency distances in the first step and then applies a density tracking method on the obtained densities to calculate the centrality for each data point. Finally, the proposed method selects a set of high-quality constraints by using the information of density and impurity of data based on the above-mentioned assumptions.

Most of the existing studies in constraint selection are built upon assumptions on the usefulness of constraints without considering the density relationship between data points. In this study, we propose a new density-based approach for querying constraints. Like most existing methods, the proposed method tries to select constraints from the boundary of clusters. In addition, the proposed method uses the new idea of selecting constraints for the skeleton of clusters to provide helpful information about the skeleton of the clusters during the selection of constraints. Selecting constraints from the skeleton of clusters, along with constraints that provide information about cluster boundaries makes clustering algorithms accurate in determining the boundaries of clusters and discovering clusters with complex shapes.

Therefore, in selecting constraints, we have used three key assumptions about the usefulness of constraints. Based on these assumptions, a candidate constraint can be considered as a useful constraint if at least one of the following conditions is present: (1) It provides helpful information about the boundary points of clusters or, (2) it provides helpful information about the boundary between different clusters or, (3) It provides helpful information about the skeleton of clusters. Based on the first assumption, a selected constraint helps clustering algorithms to precisely determine the boundary for each cluster. On the other hand, constraints queried based on the second assumption dissolve the ambiguity of adjacent clusters and help clustering algorithms to precisely discover the boundary of clusters. Finally, constraints chosen by the third assumption give clustering algorithms useful information about the shape and distribution of clusters. Such constraints can be very useful for clustering algorithms that discover the shape of clusters or learn distance metrics by using constraints. Multi-resolution density analysis to more accurately estimate the point-point relationship of data, data density tracking in order to estimate the impurity and centrality of data, and selection of constraints from skeleton of clusters in order to discover the internal structure of data can be mentioned as the most important innovations of this study.

Density tracking as a new definition of the point-point relationship of data is explained in Section 3. The proposed method for querying constraints is given in Section 4. Section 6 analyses the proposed method from the idea behind and limitations. Experimental results are reported in Section 5. Finally, the conclusion and future directions are given in Section 7.

Section snippets

Related work

As mentioned above, the problem of choosing good constraints set for constrained clustering is a crucial task. In this section, we review the principal works that show the effectiveness of the constraints collection from users in literature.

The work of Klein et al. (Klein, Kamvar, & Manning, 2002) may probably be the first work to show the benefit when the constraints are carefully chosen for constrained clustering. In Basu et al. (2004), active learning for constraints collection based on the

Density tracking

In this section, we describe density tracking that considers a new definition of the point-point relationship. It is based on the assumption that a data point $x_{i}$ should be related to its closest neighbor point $x_{j}$ with a higher density. The relationship between $x_{i}$ and $x_{j}$ is called density relationship. This concept comes from the idea that it is enough to compare each data point with its neighbors (Sibson, 1973). There are generally three steps of the density tracking: density estimation,

The proposed method

Let $X = {\{x_{i}\}}_{i = 1}^{n} \in R^{d}$ be set of data points. In a constraint selection problem, we want to choose $λ$ constraints for clustering that contains as much information as possible. To this end, we use a density-based approach for querying constraints based on the following key assumptions on the quality of constraints. A candidate constraint is useful if at least one of the following conditions is met: (1) It provides helpful information about the boundary points of clusters or, (2) it provides helpful

Experiments

To evaluate the proposed method, we have experimented on several datasets and compared the quality of the selected constraints with other methods. The default values for $K = {k_{1}, k_{2}, \dots, k_{m}}, density_drop_rate$ and $sampling_rate$ are set to ${5, 7, 9, 11, 13, 15}, 0.8$ and 3, respectively. We perform a fixed-step grid search in parameter space to determine the best-fit values of $γ$ in Algorithm 3. The step size of the search is chosen to be $0.05$ . Descriptions on the compared methods, clustering techniques, test

Analysis of the proposed method

In this section, we investigate the proposed method from the idea behind, limitations and time complexity. As mentioned in Section 4, the proposed method uses the idea of density tracking to query beneficial constraints based on three key assumptions. Based on these assumptions, the proposed method tries to select constraints that provide helpful information about the boundary and skeleton of clusters. Fig. 13 shows the output of the proposed method on several synthetic data. In this figure,

Conclusions

In this work, the idea of density tracking is used for beneficial constraints selection before clustering. We considered the density relationship between data points and proposed a method for constraints selection that considers three key assumptions on the quality of constraint. Constraints that provide helpful information about the boundary and skeleton of clusters were considered as high-quality constraints in these assumptions. Based on these assumptions, a three steps method was proposed

Declaration of Competing Interest

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: This research is funded by Vietnam National University, Hanoi (VNU) under project number QG.18.40.

Acknowledgment

This research is funded by Vietnam National University, Hanoi (VNU) under project number QG.18.40.

References (29)

A.A. Abin
Querying informative constraints for data clustering: An embedding approach
Applied Soft Computing
(2019)
A. Abin et al.
Active selection of clustering constraints: a sequential approach
Pattern Recognition
(2014)
A.A. Abin et al.
Active selection of clustering constraints: a sequential approach
Pattern Recognition
(2014)
A. Abin et al.
Active constrained fuzzy clustering: A multiple kernels learning approach
Pattern Recognition
(2015)
N. Grira et al.
Active semi-supervised fuzzy clustering
Pattern Recognition
(2008)
R. Janani et al.
Text document clustering using spectral clustering algorithm with particle swarm optimization
Expert Systems with Applications
(2019)
S. Saha et al.
Brain image segmentation using semi-supervised clustering
Expert Systems with Applications
(2016)
F. Schwenker et al.
Pattern classification and clustering: A review of partially supervised learning approaches
Pattern Recognition Letters
(2014)
M. Śmieja et al.
Semi-supervised model-based clustering with controlled clusters leakage
Expert Systems with Applications
(2017)
V.-V. Vu et al.
Improving constrained clustering with active query selection
Pattern Recognition
(2012)

Y. Ye et al.

Incorporating side information into multivariate information bottleneck for generating alternative clusterings

Pattern Recognition Letters

(2015)

X. Yin et al.

Semi-supervised clustering with metric learning: An adaptive kernel method

Pattern Recognition

(2010)

A.A. Abin

A random walk approach to query informative constraints for clustering

(2017)

A. Bar-Hillel et al.

Learning a mahalanobis metric from equivalence constraints

Journal of Machine Learning Research

(2005)

Cited by (9)

Graph-based adaptive and discriminative subspace learning for face image clustering
2022, Expert Systems with Applications
Citation Excerpt :
Clustering is a very important research direction in machine learning (Abin, 2020).
Current graph-based subspace clustering methods have achieved some results for the clustering of face images. The core of those methods lies in graph learning. However, they still have the following problems when learning the graph. Firstly, the graph learning processes of those methods do not consider the alignment of the images. It is well known that the obtained images of the same category may not be aligned due to different devices and shooting angles. The unaligned images used for graph learning directly affect the accuracy of the resulting graph. Hence, the graphs obtained by these methods are not accurate. We know that the inaccuracy of the learned graph will directly reduce the clustering performance of those methods. Secondly, they believe that important features, redundant features, and noise play the same contribution in the process of graph construction and feature representation. Redundant features and noise are not beneficial to graph reconstruction and feature representation, and even cause the learned graph to be inaccurate. Thirdly, the intrinsic structural correlation between samples is rarely considered for graph learning, which makes it difficult for the learned graph to reflect the structural correlation, then a good clustering performance cannot be obtained. To address those problems, this paper proposes a graph-based adaptive and discriminative subspace learning method (GADSL). In GADSL, image alignment is introduced and unified with subspace learning under the same graph learning framework which helps reduce the impact of different shooting equipment. Besides, GADSL can adaptively assign large weights to the important features and small weights to the unimportant features by introducing the weighting matrix. Moreover, in order to consider the correlation between samples, the structural consistency constraint is introduced into the subspace learning process so that the intra-class difference decreases and the inter-class difference increases. The experimental results show that GADSL used for face image clustering has achieved better clustering performance than many state-of-the-art methods.
SEMI-SUPERVISED CONSTRAINED CLUSTERING: AN IN-DEPTH OVERVIEW, RANKED TAXONOMY AND FUTURE RESEARCH DIRECTIONS
2023, arXiv
Constrained Density Peak Clustering
2023, International Journal of Data Warehousing and Mining
MSC-CSMC: A multi-objective semi-supervised clustering algorithm based on constraints selection and multi-source constraints for gene expression data
2023, Frontiers in Genetics
A Bibliographic View on Constrained Clustering
2022, arXiv
Multi-objective semi-supervised clustering algorithm based on constraint set optimization for gene expression data
2022, Chinese Control Conference, CCC

View all citing articles on Scopus

View full text

A density-based approach for querying informative constraints for clustering

Highlights

Abstract

Introduction

Section snippets

Related work

Density tracking

The proposed method

Experiments

Analysis of the proposed method

Conclusions

Declaration of Competing Interest

Acknowledgment

Applied Soft Computing

Pattern Recognition

Pattern Recognition

Pattern Recognition

Pattern Recognition

Expert Systems with Applications

Expert Systems with Applications

Pattern Recognition Letters

Expert Systems with Applications

Pattern Recognition

Pattern Recognition Letters

Pattern Recognition

A random walk approach to query informative constraints for clustering

Learning a mahalanobis metric from equivalence constraints

Journal of Machine Learning Research