A density-based approach for querying informative constraints for clustering

https://doi.org/10.1016/j.eswa.2020.113690Get rights and content

Highlights

  • A density-based approach is proposed for beneficial constraints selection.

  • Constraints are selected based on the impurities of data points.

  • Constraints are selected from the boundary and skeleton of clusters.

Abstract

During the last years, constrained clustering has emerged as an interesting direction in machine learning research. With constrained clustering, the quality of results can be improved by using constraints if a high-quality set of constraints is selected. Querying beneficial constraints is a challenging task because there is no metric for measuring the quality of constraints before clustering. A new method is proposed in this study that estimates density and impurity of data points on different adjacency distances and calculates centrality for each data point by applying a density tracking approach on the obtained densities. The obtained information is then used to select a set of high-quality constraints. Multi-resolution density analysis to more accurately estimate the point-point relationship of data, data density tracking in order to estimate the impurity and centrality of data, and selection of constraints from skeleton of clusters in order to discover the intrinsic structure of data can be mentioned as the most important contributions of this study. To verify the effectiveness of the proposed method, we conducted a series of experiments on real data sets. The obtained results show that the proposed algorithm can improve the clustering process compare with some recent reference algorithms.

Introduction

Clustering is an important task in the process of knowledge discovery in data mining. Clustering has wide applications in economic science, computer vision (Saha, Alok, & Ekbal, 2016), healthcare, information retrieval (Janani & Vijayarani, 2019) and the world wide web. In the past ten years, the problem of clustering with side information (known as constrained clustering) has become an active research direction to improve the quality of the results by integrating knowledge to the algorithms (Basu et al., 2008, Basu et al., 2004, Yan et al., 2006, Yin et al., 2010, Yeung and Chang, 2007, Grira et al., 2008, Ye et al., 2015, Schwenker and Trentin, 2014, Faur and Schwenker, 2014, Junhua et al., 2019, Śmieja et al., 2017). Constrained clustering uses a small set of side information to obtain the expected partitioning of data. Must-link (ML) and cannot-link (CL) constraints are the most widely used forms of side information in constrained clustering (Basu et al., 2008). A must-link constraint indicates that two points of the data set should be grouped in the same cluster while a cannot-link constraint imposes that the points should be grouped in different clusters (See Fig. 1).

Beside with the researching of the constrained clustering algorithm, the problem of choosing good constraints for constrained clustering is a crucial task. With a data set including n points, it can produce a lot of candidates for constraints (i.e. n(n+1)2 pairs), even some poorly chosen constraints can lead to degrading the performance of clustering processes (Davidson, Wagstaff, & Basu, 2006). Moreover, it is very difficult to evaluate how the quality of a set of constraints are. In fact, with the same constrained clustering algorithm, the quality of the clustering process heavily depends on the quality of constraints that are given. Fig. 2 shows the relationship between constrained clustering and active learning problem in the context of expecting the best side information for constrained clustering. In recent years, the topic of active learning for constraints selection from users have been attracted a lot of attention. Many algorithms have been done for the problem of choosing the good constraints for constrained clustering, we can cite here the work of Grira et al. for fuzzy C-Means clustering in Grira et al. (2008), the work of Basu et al. for K-Means clustering in Basu et al. (2004), the work of Vu et al. for all kinds of constrained clustering in Vu, Labroche, and Bouchon-Meunier, 2012, the work of Abin et al. in (Abin and Beigy, 2014, Abin and Beigy, 2015, Abin, 2019).

Most of the methods in the field of constraint selection have used the idea that constraints that resolve the ambiguity between clusters are useful, and thus have tried to select constraints from the boundaries of clusters. They have ignored the fact that the constraints that provide information about the skeleton of clusters can also be very useful. Constraints that provide useful information about the skeleton of clusters allow clustering algorithms to better discover complex-shaped clusters. In this work, we address the problem of beneficial constraints selection by considering three key assumptions on the usefulness of constraints based on the density relationship between data points. Based on these assumptions, we select constraints that provide helpful information about the boundary and skeleton of clusters. The proposed method estimates the density and impurity of data points on different adjacency distances in the first step and then applies a density tracking method on the obtained densities to calculate the centrality for each data point. Finally, the proposed method selects a set of high-quality constraints by using the information of density and impurity of data based on the above-mentioned assumptions.

Most of the existing studies in constraint selection are built upon assumptions on the usefulness of constraints without considering the density relationship between data points. In this study, we propose a new density-based approach for querying constraints. Like most existing methods, the proposed method tries to select constraints from the boundary of clusters. In addition, the proposed method uses the new idea of selecting constraints for the skeleton of clusters to provide helpful information about the skeleton of the clusters during the selection of constraints. Selecting constraints from the skeleton of clusters, along with constraints that provide information about cluster boundaries makes clustering algorithms accurate in determining the boundaries of clusters and discovering clusters with complex shapes.

Therefore, in selecting constraints, we have used three key assumptions about the usefulness of constraints. Based on these assumptions, a candidate constraint can be considered as a useful constraint if at least one of the following conditions is present: (1) It provides helpful information about the boundary points of clusters or, (2) it provides helpful information about the boundary between different clusters or, (3) It provides helpful information about the skeleton of clusters. Based on the first assumption, a selected constraint helps clustering algorithms to precisely determine the boundary for each cluster. On the other hand, constraints queried based on the second assumption dissolve the ambiguity of adjacent clusters and help clustering algorithms to precisely discover the boundary of clusters. Finally, constraints chosen by the third assumption give clustering algorithms useful information about the shape and distribution of clusters. Such constraints can be very useful for clustering algorithms that discover the shape of clusters or learn distance metrics by using constraints. Multi-resolution density analysis to more accurately estimate the point-point relationship of data, data density tracking in order to estimate the impurity and centrality of data, and selection of constraints from skeleton of clusters in order to discover the internal structure of data can be mentioned as the most important innovations of this study.

Density tracking as a new definition of the point-point relationship of data is explained in Section 3. The proposed method for querying constraints is given in Section 4. Section 6 analyses the proposed method from the idea behind and limitations. Experimental results are reported in Section 5. Finally, the conclusion and future directions are given in Section 7.

Section snippets

Related work

As mentioned above, the problem of choosing good constraints set for constrained clustering is a crucial task. In this section, we review the principal works that show the effectiveness of the constraints collection from users in literature.

The work of Klein et al. (Klein, Kamvar, & Manning, 2002) may probably be the first work to show the benefit when the constraints are carefully chosen for constrained clustering. In Basu et al. (2004), active learning for constraints collection based on the

Density tracking

In this section, we describe density tracking that considers a new definition of the point-point relationship. It is based on the assumption that a data point xi should be related to its closest neighbor point xj with a higher density. The relationship between xi and xj is called density relationship. This concept comes from the idea that it is enough to compare each data point with its neighbors (Sibson, 1973). There are generally three steps of the density tracking: density estimation,

The proposed method

Let X=xii=1nRd be set of data points. In a constraint selection problem, we want to choose λ constraints for clustering that contains as much information as possible. To this end, we use a density-based approach for querying constraints based on the following key assumptions on the quality of constraints. A candidate constraint is useful if at least one of the following conditions is met: (1) It provides helpful information about the boundary points of clusters or, (2) it provides helpful

Experiments

To evaluate the proposed method, we have experimented on several datasets and compared the quality of the selected constraints with other methods. The default values for K={k1,k2,,km},density_drop_rate and sampling_rate are set to {5,7,9,11,13,15},0.8 and 3, respectively. We perform a fixed-step grid search in parameter space to determine the best-fit values of γ in Algorithm 3. The step size of the search is chosen to be 0.05. Descriptions on the compared methods, clustering techniques, test

Analysis of the proposed method

In this section, we investigate the proposed method from the idea behind, limitations and time complexity. As mentioned in Section 4, the proposed method uses the idea of density tracking to query beneficial constraints based on three key assumptions. Based on these assumptions, the proposed method tries to select constraints that provide helpful information about the boundary and skeleton of clusters. Fig. 13 shows the output of the proposed method on several synthetic data. In this figure,

Conclusions

In this work, the idea of density tracking is used for beneficial constraints selection before clustering. We considered the density relationship between data points and proposed a method for constraints selection that considers three key assumptions on the quality of constraint. Constraints that provide helpful information about the boundary and skeleton of clusters were considered as high-quality constraints in these assumptions. Based on these assumptions, a three steps method was proposed

Declaration of Competing Interest

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: This research is funded by Vietnam National University, Hanoi (VNU) under project number QG.18.40.

Acknowledgment

This research is funded by Vietnam National University, Hanoi (VNU) under project number QG.18.40.

References (29)

  • Y. Ye et al.

    Incorporating side information into multivariate information bottleneck for generating alternative clusterings

    Pattern Recognition Letters

    (2015)
  • X. Yin et al.

    Semi-supervised clustering with metric learning: An adaptive kernel method

    Pattern Recognition

    (2010)
  • A.A. Abin

    A random walk approach to query informative constraints for clustering

    (2017)
  • A. Bar-Hillel et al.

    Learning a mahalanobis metric from equivalence constraints

    Journal of Machine Learning Research

    (2005)
  • Cited by (9)

    • Graph-based adaptive and discriminative subspace learning for face image clustering

      2022, Expert Systems with Applications
      Citation Excerpt :

      Clustering is a very important research direction in machine learning (Abin, 2020).

    • Constrained Density Peak Clustering

      2023, International Journal of Data Warehousing and Mining
    View all citing articles on Scopus
    View full text