Supervised kernel density estimation K-means

https://doi.org/10.1016/j.eswa.2020.114350Get rights and content

Highlights

  • A new method of supervised learning based on k-means.

  • Managing imbalanced classes, by creating and weighting new clusters.

  • Managing degenerated clusters, marking and remove useless clusters.

  • A new approach to incremental semi-supervised learning based on the proposed method.

  • High confidence unlabeled data selected by kernel density estimators and similarity.

Abstract

K-means is a well-known unsupervised-learning algorithm. It assigns data points to k clusters, the centers of which are termed centroids.

However, these centroids have a structure usually represented by a list of quantized vectors, so that kernel density estimation models can better represent complex data distributions. This paper proposes a k-means-based supervised-learning clustering method termed supervised kernel-density-estimation k-means. The proposed approach uses kernel density estimation for class examples inside each cluster to obtain a better representation of the data distribution. The algorithm constructs an initial model using supervised k-means with an equal seed distribution among the classes so that a balance between majority and minority classes is achieved. We also incorporate incremental semi-supervised learning into the proposed method. Experiments were conducted using publicly available benchmark datasets. The results demonstrated that, compared with state-of-the-art supervised methods, the proposed algorithm, which can also perform incremental semi-supervised learning, achieved highly satisfactory performance.

Introduction

K-means is a classic unsupervised learning method that constructs a cluster structure comprising a list of quantized vectors or centroid coordinates to represent groups of examples. Although this approach is interesting, it has a few drawbacks. Some of these drawbacks have been addressed, but others require further exploration. For instance, a major issue is the use of hyperspherical centroids to represent the clusters (Guha et al., 1998, Jain, 2010, Raykov et al., 2016) because k-means minimizes the within-cluster variance, generating a convex Voronoi-cell-shaped cluster (Adams, 2018); another issue is the use of centroids that are obtained using all training data, not allowing new examples to be incrementally learnt. It is conceivable that with proper modifications, k-means can be revisited and converted into a supervised classifier; furthermore, the k-means structure can be easily updated with new data so that it can dynamically adapt to data-distribution changes in real world scenarios.

Several supervised k-means implementations have been proposed. In such an implementation, classes can be assigned to clusters of labeled data. A common technique known as constrained clustering (Wagstaff et al., 2001) is to maintain a must-link list of examples to maximize the majority class in a cluster, and a cannot-link list to prevent examples from being in the same cluster. In this paper, we propose using kernel density estimation (KDE) to estimate the underlying probability model of a dataset, as data do not always follow a unimodal probability distribution, such as when one class is assigned to a k-means cluster centroid. Instead of assigning a class to a k-means cluster, KDE can generate a better cluster model, refining the partition of examples. The proposed method attaches a KDE model to each cluster to improve the underlying probability-density modeling of cluster data, and consequently classification performance. The proposed method can also handle generalization issues, as some artifacts may be degenerate clusters with few examples.

We also develop an incremental semi-supervised version of the proposed algorithm. Incremental learning is an interesting method in several cases where a classifier does not have all learning data available in a single dataset. Incremental learning allows the incorporation of new data into a model, without requiring access to all the original data. An incremental technique need not be retrained from the beginning to insert new data. In these techniques, learning occurs continuously over time and does not cease once all available data have been exhausted (Giraud-Carrier, 2000).

There are several applications in which few labeled examples are initially involved, and thus incremental learning techniques are desirable: (a) Robotics (Gepperth & Hammer, 2016), specifically, autonomous control, service robotics, self-localization, and autonomous vehicles. For example, in a typical simultaneously localization and mapping problem, a robot should incrementally learn its location and generate a map (Ji et al., 2008). (b) Concept drift. There are several problems involving nonstationary data environment (the statistical characteristics of the data change over time); in such problems, continuous learning is necessary (Elwell & Polikar, 2011). For example, spam filtering methods should adapt to a dynamic environment (Sheu et al., 2017) because the spam style and/or preferences of a user can change over time. (c) Smart homes. Learning user behavior is important to achieve an adaptable environment and identify problematic situations (De Carolis et al., 2015). Accordingly, incremental learning is a promising approach. (d) Outlier detection (Gepperth & Hammer, 2016), specifically, process monitoring, fault diagnosis, and cyber-security. For example, incremental learning can be useful to detect suspicious activity using outlier-detection techniques (Pokrajac et al., 2009).

Semi-supervised learning is another important technique because traditional classifiers often use only labeled data for training, but the amount of such data may be quite limited. Moreover, it may not be easy to obtain labeled data because classification is an expensive and time-consuming task that should be performed by experts. However, there are vast amounts of unlabeled data that are relatively easy to collect and may be useful for classification purposes but are normally neglected. Semi-supervised learning techniques handle this by using a large amount of unlabeled data along with labeled data to construct a classification or prediction model (Zhu, 2005).

A useful approach is to combine incremental with semi-supervised learning. Incremental supervised learning techniques learn as new data are collected, but they require human intervention for data labeling, whereas semi-supervised learning methods can label examples and use them for training, but they normally require storing the entire training dataset to obtain a new model. The combination of both techniques allows a fast, self-adaptive learning system that does not require human interference to label the examples.

In fact, the aim of this study is to design a supervised learning k-means method that can generate an initial model from a small number of labeled examples by using kernel density estimation, and can efficiently adapt its structure so that it may learn new (labeled and unlabeled) examples. For this purpose, we propose several modifications to k-means that involve changes in the clustering procedure as well as structural changes.

The proposed method is termed supervised KDE k-means (SKDEKMeans) and was tested on public datasets. The results were compared with those obtained by an online random forest model termed Mondrian forest (MF) (Lakshminarayanan et al., 2014), a semi-supervised Learn++ model (SSLearn++) (Polikar et al., 2001), an incremental extreme learning machine model (IELM) (Huang & Chen, 2007), a stochastic gradient descent model (SGD), a Gaussian naive Bayes model (GNB) with incremental updates (Chan et al., 1979), and an online (semi-)supervised growing-when-required model (O(S)SGWR) (Parisi et al., 2017). It was demonstrated that the proposed method performed better than the other approaches; furthermore, it is less complex and faster than the node-based algorithm OSSGWR. Therefore, the proposed technique based on k-means and incremental semi-supervised learning is quite promising.

The contributions of this study are the following:

  • A new supervised-learning method based on k-means is proposed.

  • Imbalanced classes are handled by generating new clusters and weighting their influence.

  • Degenerated clusters are handled by marking and removing useless clusters.

  • Incremental semi-supervised learning is incorporated into the proposed method. It includes a set of conditions to select only high-confidence unlabeled examples based on kernel density estimators and other similarity measures.

This paper is organized as follows. In Section 2, previous work related to supervised and semi-supervised clustering and incremental learning, specifically using k-means, is review. In Section 3, changes to k-means for supervised, semi-supervised, and incremental learning are explained; moreover, the proposed method is described. The experiments are described in Section 4, and the results are presented in Section 5. Finally, Section 6 concludes the paper.

Section snippets

Related work

Herein, we briefly review the k-means algorithm as well as concepts and approaches related to supervised learning, KDE, and incremental semi-supervised learning using k-means.

Materials and methods

The proposed method is based on the k-means cluster structure and KDE. It primarily changes the use of the k parameter. Specifically, it dynamically increases and decreases the k value to fit the data in a supervised learning context, attempting to match size and distribution of each class and eliminate outliers. The method differs from previous semi-supervised approaches in that it does not use pairwise constraints but similarity measures. However, it uses the concept of cluster induction

Experimental setup

Herein, we present the benchmark datasets, baseline classifiers, parameter settings, and data division for the experiments.

Results

The accuracy of the methods trained using the UCI, SSL-Book, and KEEL benchmarks is shown in Table 3, which presents the results obtained using labeled examples, and Table 4, which refers to incremental training of the remaining data (unlabeled examples). The tables also contain results obtained by the proposed methods, a label propagation (LP) model, MF, a Learn++ model (LPP), IELM, SGD, GNB, and O(S)SGWR. Additional results by a k-nearest-neighbors model (kNN, with k=1) exclusively for

Conclusions

In this study, we developed a supervised-learning method using k-means aided by kernel density estimation to support an incremental semi-supervised learning method. Specifically, we proposed a supervised method and a prototype of an incremental method. For the latter, we established a tentative set of conditions for selecting high-confidence unlabeled examples for training. These conditions were coupled with KDE models obtained by the supervised method. Moreover, the proposed method has a

CRediT authorship contribution statement

Frederico Damasceno Bortoloti: Methodology, Data curation, Investigation, Writing - original draft, Software, Formal analysis, Visualization, Validation. Elias de Oliveira: Conceptualization, Methodology, Writing - review & editing. Patrick Marques Ciarelli: Conceptualization, Methodology, Writing - review & editing, Formal analysis.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

Patrick Marques Ciarelli thanks the partial funding of his research work provided by CNPq, Brazil (grant 312032/2015-3).

References (58)

  • Bar-HillelA. et al.

    Learning a mahalanobis metric from equivalence constraints

    Journal of Machine Learning Research

    (2005)
  • BasuS. et al.

    Semi-supervised clustering by seeding

  • BasuS. et al.

    Active semi-supervision for pairwise constrained clustering

  • BradleyP.S. et al.

    Constrained k-means clustering

  • ChambersJ.M. et al.

    Graphical methods for data analysis

    (1983)
  • ChanT.F. et al.

    Updating formulae and a pairwise algorithm for computing sample variances

    (1979)
  • ChapelleO. et al.

    Semi-supervised learning

    (2010)
  • ChemchemA. et al.

    Incremental induction rules clustering

  • De CarolisB. et al.

    Incremental learning of daily routines as workflows in a smart home environment

    ACM Transactions on Interactive Intelligent Systems

    (2015)
  • DemsarJ.

    Statistical comparisons of classifiers over multiple data sets

    Journal of Machine Learning Research

    (2006)
  • ElwellR. et al.

    Incremental learning of concept drift in nonstationary environments

    IEEE Transactions on Neural Networks

    (2011)
  • EpanechnikovV.A.

    Non-parametric estimation of a multivariate probability density

    Theory of Probability & Its Applications

    (1969)
  • ForgyE.

    Cluster analysis of multivariate data: efficiency versus interpretability of classifications

    Biometrics

    (1965)
  • FritzkeB.

    A growing neural gas network learns topologies

  • GepperthA. et al.

    Incremental learning algorithms and applications

  • Giraud-CarrierC.

    A note on the utility of incremental learning

    AI Communications

    (2000)
  • GriraN. et al.

    Unsupervised and semi-supervised clustering: A brief survey. a review of machine learning techniques for processing multimedia contentReport of the MUSCLE European network of excellence (FP6)

    (2004)
  • GuhaS. et al.

    CURE: An efficient clustering algorithm for large databases

  • Cited by (23)

    • The forensic information identification based on machine learning algorithms

      2023, Forensic Science International: Digital Investigation
    • FC-Kmeans: Fixed-centered K-means algorithm

      2023, Expert Systems with Applications
    • DPCF: A framework for imputing missing values and clustering data in drug discovery process

      2022, Chemometrics and Intelligent Laboratory Systems
      Citation Excerpt :

      Being a semi-supervised clustering approach, constraint-based clustering techniques incorporate the above four constraints to perform the clustering. COP K-means [33], CMWK-Means [34], CFDP [35] are the common techniques in this type of clustering. Unlike data classification, clustering data which is an unsupervised learning technique requires different evaluation criteria.

    View all citing articles on Scopus
    View full text