Supervised kernel density estimation K-means

doi:10.1016/j.eswa.2020.114350

Expert Systems with Applications

Volume 168, 15 April 2021, 114350

https://doi.org/10.1016/j.eswa.2020.114350 Get rights and content

Highlights

•
A new method of supervised learning based on k-means.
•
Managing imbalanced classes, by creating and weighting new clusters.
•
Managing degenerated clusters, marking and remove useless clusters.
•
A new approach to incremental semi-supervised learning based on the proposed method.
•
High confidence unlabeled data selected by kernel density estimators and similarity.

Abstract

K-means is a well-known unsupervised-learning algorithm. It assigns data points to $k$ clusters, the centers of which are termed centroids.

However, these centroids have a structure usually represented by a list of quantized vectors, so that kernel density estimation models can better represent complex data distributions. This paper proposes a k-means-based supervised-learning clustering method termed supervised kernel-density-estimation k-means. The proposed approach uses kernel density estimation for class examples inside each cluster to obtain a better representation of the data distribution. The algorithm constructs an initial model using supervised k-means with an equal seed distribution among the classes so that a balance between majority and minority classes is achieved. We also incorporate incremental semi-supervised learning into the proposed method. Experiments were conducted using publicly available benchmark datasets. The results demonstrated that, compared with state-of-the-art supervised methods, the proposed algorithm, which can also perform incremental semi-supervised learning, achieved highly satisfactory performance.

Introduction

K-means is a classic unsupervised learning method that constructs a cluster structure comprising a list of quantized vectors or centroid coordinates to represent groups of examples. Although this approach is interesting, it has a few drawbacks. Some of these drawbacks have been addressed, but others require further exploration. For instance, a major issue is the use of hyperspherical centroids to represent the clusters (Guha et al., 1998, Jain, 2010, Raykov et al., 2016) because k-means minimizes the within-cluster variance, generating a convex Voronoi-cell-shaped cluster (Adams, 2018); another issue is the use of centroids that are obtained using all training data, not allowing new examples to be incrementally learnt. It is conceivable that with proper modifications, k-means can be revisited and converted into a supervised classifier; furthermore, the k-means structure can be easily updated with new data so that it can dynamically adapt to data-distribution changes in real world scenarios.

Several supervised k-means implementations have been proposed. In such an implementation, classes can be assigned to clusters of labeled data. A common technique known as constrained clustering (Wagstaff et al., 2001) is to maintain a must-link list of examples to maximize the majority class in a cluster, and a cannot-link list to prevent examples from being in the same cluster. In this paper, we propose using kernel density estimation (KDE) to estimate the underlying probability model of a dataset, as data do not always follow a unimodal probability distribution, such as when one class is assigned to a k-means cluster centroid. Instead of assigning a class to a k-means cluster, KDE can generate a better cluster model, refining the partition of examples. The proposed method attaches a KDE model to each cluster to improve the underlying probability-density modeling of cluster data, and consequently classification performance. The proposed method can also handle generalization issues, as some artifacts may be degenerate clusters with few examples.

We also develop an incremental semi-supervised version of the proposed algorithm. Incremental learning is an interesting method in several cases where a classifier does not have all learning data available in a single dataset. Incremental learning allows the incorporation of new data into a model, without requiring access to all the original data. An incremental technique need not be retrained from the beginning to insert new data. In these techniques, learning occurs continuously over time and does not cease once all available data have been exhausted (Giraud-Carrier, 2000).

There are several applications in which few labeled examples are initially involved, and thus incremental learning techniques are desirable: (a) Robotics (Gepperth & Hammer, 2016), specifically, autonomous control, service robotics, self-localization, and autonomous vehicles. For example, in a typical simultaneously localization and mapping problem, a robot should incrementally learn its location and generate a map (Ji et al., 2008). (b) Concept drift. There are several problems involving nonstationary data environment (the statistical characteristics of the data change over time); in such problems, continuous learning is necessary (Elwell & Polikar, 2011). For example, spam filtering methods should adapt to a dynamic environment (Sheu et al., 2017) because the spam style and/or preferences of a user can change over time. (c) Smart homes. Learning user behavior is important to achieve an adaptable environment and identify problematic situations (De Carolis et al., 2015). Accordingly, incremental learning is a promising approach. (d) Outlier detection (Gepperth & Hammer, 2016), specifically, process monitoring, fault diagnosis, and cyber-security. For example, incremental learning can be useful to detect suspicious activity using outlier-detection techniques (Pokrajac et al., 2009).

Semi-supervised learning is another important technique because traditional classifiers often use only labeled data for training, but the amount of such data may be quite limited. Moreover, it may not be easy to obtain labeled data because classification is an expensive and time-consuming task that should be performed by experts. However, there are vast amounts of unlabeled data that are relatively easy to collect and may be useful for classification purposes but are normally neglected. Semi-supervised learning techniques handle this by using a large amount of unlabeled data along with labeled data to construct a classification or prediction model (Zhu, 2005).

A useful approach is to combine incremental with semi-supervised learning. Incremental supervised learning techniques learn as new data are collected, but they require human intervention for data labeling, whereas semi-supervised learning methods can label examples and use them for training, but they normally require storing the entire training dataset to obtain a new model. The combination of both techniques allows a fast, self-adaptive learning system that does not require human interference to label the examples.

In fact, the aim of this study is to design a supervised learning k-means method that can generate an initial model from a small number of labeled examples by using kernel density estimation, and can efficiently adapt its structure so that it may learn new (labeled and unlabeled) examples. For this purpose, we propose several modifications to k-means that involve changes in the clustering procedure as well as structural changes.

The proposed method is termed supervised KDE k-means (SKDEKMeans) and was tested on public datasets. The results were compared with those obtained by an online random forest model termed Mondrian forest (MF) (Lakshminarayanan et al., 2014), a semi-supervised Learn++ model (SSLearn++) (Polikar et al., 2001), an incremental extreme learning machine model (IELM) (Huang & Chen, 2007), a stochastic gradient descent model (SGD), a Gaussian naive Bayes model (GNB) with incremental updates (Chan et al., 1979), and an online (semi-)supervised growing-when-required model ( $O (S) S G W R$ ) (Parisi et al., 2017). It was demonstrated that the proposed method performed better than the other approaches; furthermore, it is less complex and faster than the node-based algorithm $O S S G W R$ . Therefore, the proposed technique based on k-means and incremental semi-supervised learning is quite promising.

The contributions of this study are the following:

•
A new supervised-learning method based on k-means is proposed.
•
Imbalanced classes are handled by generating new clusters and weighting their influence.
•
Degenerated clusters are handled by marking and removing useless clusters.
•
Incremental semi-supervised learning is incorporated into the proposed method. It includes a set of conditions to select only high-confidence unlabeled examples based on kernel density estimators and other similarity measures.

This paper is organized as follows. In Section 2, previous work related to supervised and semi-supervised clustering and incremental learning, specifically using k-means, is review. In Section 3, changes to k-means for supervised, semi-supervised, and incremental learning are explained; moreover, the proposed method is described. The experiments are described in Section 4, and the results are presented in Section 5. Finally, Section 6 concludes the paper.

Section snippets

Related work

Herein, we briefly review the k-means algorithm as well as concepts and approaches related to supervised learning, KDE, and incremental semi-supervised learning using k-means.

Materials and methods

The proposed method is based on the k-means cluster structure and KDE. It primarily changes the use of the $k$ parameter. Specifically, it dynamically increases and decreases the $k$ value to fit the data in a supervised learning context, attempting to match size and distribution of each class and eliminate outliers. The method differs from previous semi-supervised approaches in that it does not use pairwise constraints but similarity measures. However, it uses the concept of cluster induction

Experimental setup

Herein, we present the benchmark datasets, baseline classifiers, parameter settings, and data division for the experiments.

Results

The accuracy of the methods trained using the UCI, SSL-Book, and KEEL benchmarks is shown in Table 3, which presents the results obtained using labeled examples, and Table 4, which refers to incremental training of the remaining data (unlabeled examples). The tables also contain results obtained by the proposed methods, a label propagation (LP) model, MF, a Learn++ model (LPP), IELM, SGD, GNB, and O(S)SGWR. Additional results by a k-nearest-neighbors model (kNN, with $k = 1$ ) exclusively for

Conclusions

In this study, we developed a supervised-learning method using k-means aided by kernel density estimation to support an incremental semi-supervised learning method. Specifically, we proposed a supervised method and a prototype of an incremental method. For the latter, we established a tentative set of conditions for selecting high-confidence unlabeled examples for training. These conditions were coupled with KDE models obtained by the supervised method. Moreover, the proposed method has a

CRediT authorship contribution statement

Frederico Damasceno Bortoloti: Methodology, Data curation, Investigation, Writing - original draft, Software, Formal analysis, Visualization, Validation. Elias de Oliveira: Conceptualization, Methodology, Writing - review & editing. Patrick Marques Ciarelli: Conceptualization, Methodology, Writing - review & editing, Formal analysis.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

Patrick Marques Ciarelli thanks the partial funding of his research work provided by CNPq, Brazil (grant 312032/2015-3).

References (58)

HuangG.B. et al.
Convex incremental extreme learning machine
Neurocomputing
(2007)
JainA.K.
Data clustering: 50 years beyond k-means
Pattern Recognition Letters
(2010)
MarslandS. et al.
A self-organising network that grows when required
Neural Networks
(2002)
ParisiG.I. et al.
Emergence of multimodal action representations from neural network self-organization
Cognitive Systems Research
(2017)
AckermanM. et al.
Incremental clustering: The case for extra clusters
(2014)
AdamsR.P.
K-means clustering and related algorithms
(2018)
Alcalá-FdezJ. et al.
Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework
Journal of Multiple-Valued Logic and Soft Computing
(2011)
AlsabtiK. et al.
An efficient k-means clustering algorithm
Electical Engineering and Computer Science
(1997)
ArthurD. et al.
K-means++: The advantages of careful seeding
BairE.
Semi-supervised clustering methods
Wiley Interdisciplinary Reviews: Computational Statistics
(2013)

Bar-HillelA. et al.

Learning a mahalanobis metric from equivalence constraints

Journal of Machine Learning Research

(2005)

BasuS. et al.

Semi-supervised clustering by seeding

BasuS. et al.

Active semi-supervision for pairwise constrained clustering

BradleyP.S. et al.

Constrained k-means clustering

ChambersJ.M. et al.

Graphical methods for data analysis

(1983)

ChanT.F. et al.

Updating formulae and a pairwise algorithm for computing sample variances

(1979)

ChapelleO. et al.

Semi-supervised learning

(2010)

ChemchemA. et al.

Incremental induction rules clustering

De CarolisB. et al.

Incremental learning of daily routines as workflows in a smart home environment

ACM Transactions on Interactive Intelligent Systems

(2015)

DemsarJ.

Statistical comparisons of classifiers over multiple data sets

Journal of Machine Learning Research

(2006)

ElwellR. et al.

Incremental learning of concept drift in nonstationary environments

IEEE Transactions on Neural Networks

(2011)

EpanechnikovV.A.

Non-parametric estimation of a multivariate probability density

Theory of Probability & Its Applications

(1969)

ForgyE.

Cluster analysis of multivariate data: efficiency versus interpretability of classifications

Biometrics

(1965)

FritzkeB.

A growing neural gas network learns topologies

GepperthA. et al.

Incremental learning algorithms and applications

Giraud-CarrierC.

A note on the utility of incremental learning

AI Communications

(2000)

GriraN. et al.

Unsupervised and semi-supervised clustering: A brief survey. a review of machine learning techniques for processing multimedia contentReport of the MUSCLE European network of excellence (FP6)

(2004)

GuhaS. et al.

CURE: An efficient clustering algorithm for large databases

Cited by (23)

The forensic information identification based on machine learning algorithms
2023, Forensic Science International: Digital Investigation
Investigating crimes in selected areas, studying their occurrence and appropriate graphical visualization may be an important supportive information for police units. Nowadays, the available IT tools can improve the work of law enforcement agencies. For this purpose, in this work, we propose an exploration approach which allows for analyzing crime events that are recorded in the city of Baltimore, Maryland, USA. Based on the collected data, seven types of offenses are recorded. First, they are analyzed in terms of time verifying whether any type of event depends on the day of the week or a period of the day. Then the distribution of the crimes is examined by spatial clusterization and kernel density estimation methods. As a result of the analysis, it is shown when and where the citizens of the considered city are subject to the highest crime rate.
Analysis of spatial and temporal carbon emission efficiency in Yangtze River Delta city cluster — Based on nighttime lighting data and machine learning
2023, Environmental Impact Assessment Review
Improving carbon emission efficiency(CEE) is crucial to reducing CO₂ emissions. Most studies on CO₂ emission are conducted at national and industrial scales, and city-scale studies still need to be included. In order to collect more consistent city- and county-scale CO₂ emission data, the sparrow search neural network is trained to fit the energy consumption CO₂ emissions with nighttime light in this study. Additionally, using the SBM-DEA model and spatial econometric techniques, the CEE values of 27 cities in the Yangtze River Delta region (YRDR) from 2000 to 2020 were examined from the perspective of total factor inputs. The findings demonstrate that CEE's general trend is erratic and uneven. The CEE value of the YRDR decreases from 0.720 in 2000 to 0.628 by 2020, which means that the YRDR has redundant capital and labour inputs and insufficient economic output. The low value carbon efficiency areas are mainly concentrated in the western part of the YRDR, i.e. the Anhui Province region. Shanghai, Wuxi and Suzhou have high carbon efficiency values of 1.21, 2.08 and 1.00 respectively, and are exemplary cities in terms of carbon efficiency, while the rest of the cities have varying degrees of efficiency loss. Taking Chizhou-Jiaxing as the middle line, the CEE pattern in the YRDR presents a state of “low in the middle and high at each end,” and center of gravity for CEE generally shifts southward. Additionally, the cold-spot areas of CEE are concentrated in the southern part of Anhui Province, and develop a low-efficiency zone with Chizhou, Anqing, and Xuancheng as clusters and spreading outwards. Overall, this paper significantly narrows the spatial scale of carbon accounting studies and the findings can be applied to the formulation of customized carbon reduction policies.
Vessel sailing route extraction and analysis from satellite-based AIS data using density clustering and probability algorithms
2023, Ocean Engineering
The vessel Automatic Identification System (AIS) data collected by satellites have the features of large coverage area and large data volume, and they are instantaneous discrete data rather than time-continuous data, so the data has large dispersion with many noise points. This poses a challenge for vessel sailing route extraction. This paper proposes a vessel sailing route extraction method which consists of the fast Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm and the Kernel Density Estimation-based Outlier Factor (KDE-based OF) noise reduction algorithm. The method in this paper firstly improves the clustering discrimination method in the DBSCAN algorithm to separate trajectories in different directions. Secondly, this paper extracts a fast clustering algorithm based on the density clustering algorithm to reduce its computing time overhead with satellite big data. Finally, this paper proposes the KDE-based OF processing algorithm, which calculates the outlier probability distribution value of the trajectory points through the algorithm to eliminate the edge trajectory points with low probability distribution. Based on the actual satellite vessel AIS data, this paper conducts multi-method comparisons and performance analysis experiments. Experiments show that the proposed method has the best stability and advancement.
Multi-scenario data-driven robust optimisation for industrial steam power systems under uncertainty
2023, Energy
In actual industrial production, the deterministic optimisation of the steam power system cannot meet most production scenarios due to the influence of uncertain factors such as product demand and environmental conditions. This paper proposes an operational optimisation method for SPS under uncertainty by combining multi-scenario partition and data-driven adaptive robust optimisation algorithm. A hybrid equipment model was developed to modify the critical equipment models based on industrial data and process mechanisms. Considering that demand uncertainty varies with the different steam quality, the clustering method divides the entire time horizon into several periods, and the uncertainty set is constructed by variable robust kernel density estimation for each period. A multi-scenario data-driven robust optimisation model is developed by incorporating uncertainty sets into deterministic optimisation, and the counterpart model is obtained through the affine decision rules. Furthermore, the proposed framework is applied to the SPS of a coal chemical plant to verify the feasibility. The annual operating costs before and after optimisation are 125 million USD and 123 million USD, respectively, and the system's energy efficiency can be improved by more than 5%.
FC-Kmeans: Fixed-centered K-means algorithm
2023, Expert Systems with Applications
Clustering is one of the data mining methods that partition large-sized data into subgroups according to their similarities. K-means clustering algorithm works well in spherical or convex data distribution of large-sized data sets. Most of the algorithms based on K-means have generally been interested in an initial cluster centers selection or cluster distribution. However, these algorithms may not meet satisfy some requirements in practice. This paper presents the FC-Kmeans algorithm, which enables clustering by fixing some cluster centers considering real conditions. Thus, while some of the cluster centers are fixed, it is tried to obtain the most appropriate cluster centers for the others and the best distribution of the data to the clusters. The K-means clustering algorithm is compared with two different fixed-centered clustering algorithms which are FC-Kmeans and FC-Kmeans 2. The experimental results show that although the FC-Kmeans algorithm has more limitations than K-means, it converges the performance of K-means algorithm according to some performance indicators such as SSE, DB Index and Silhouette Index.
DPCF: A framework for imputing missing values and clustering data in drug discovery process
2022, Chemometrics and Intelligent Laboratory Systems
Citation Excerpt :
Being a semi-supervised clustering approach, constraint-based clustering techniques incorporate the above four constraints to perform the clustering. COP K-means [33], CMWK-Means [34], CFDP [35] are the common techniques in this type of clustering. Unlike data classification, clustering data which is an unsupervised learning technique requires different evaluation criteria.
The advent of modern Internet of Things (IoT) architectures has led to ease in data collection and availability. The data generated from such architectures are of large volume and dimensionality. As a result, data missingness and data labeling are the commonly occurring problems in the data collection process when data volume is too large. Data clustering is a commonly used unsupervised pattern classification technique that helps in identifying the hidden structure of the datasets and clusters or groups similar data items together. In context to chemometrics, clustering techniques play a significant role in identifying the structure-property relationships and structure-activity relationships among different compounds in drug discovery process. The quantitative structure-property relationships (QSPR) are based on the similar property principle which states that if the compounds are clustered together based on structural descriptors, the compounds within the same cluster will have similar properties. The quantitative structure-activity relationships (QSAR) help in determining the empirical relationships between chemical structure and biological activity of a set of identical compounds. Hence, a variety of compounds can be divided into homogeneous subsets using cluster analysis. The effectiveness of any clustering algorithm mostly depends on how efficiently the initial cluster centers are identified. Traditional techniques that mostly rely on the random selection of initial cluster centers and different parameter settings may result in distinct clusters for the same dataset. Moreover, if the initial cluster centers selected are outliers then the clusters formed are of poor quality. The quality of each cluster can be determined by using distinct cluster validation indices (CVIs). This paper aims to solve the problems of data missingness as well as data labeling for unsupervised datasets. In continuation to the previous study where the NMVI (Nullify the Missing Values before Imputation) technique efficiently imputes the missing values in different datasets, in this paper the proposed DPCF (Data Partitioning-based Clustering Framework) framework utilize the NMVI technique to impute the missing values and a novel Z-Clust clustering algorithm is proposed that efficiently clusters the unlabeled data samples. The integration of the NMVI technique and Z-Clust clustering technique in the proposed framework makes it well suited for the analysis of unsupervised datasets having missing values. The performance evaluation of the proposed Z-Clust clustering technique is done by using five standard CVIs and the results are compared with the existing clustering techniques. The experimental results depict that the proposed Z-Clust clustering technique shows better cluster formation as compared to the existing clustering techniques. Henceforth, the proposed DPCF framework is well suited for the analysis of datasets without labels and having missing values.

View all citing articles on Scopus

View full text

Supervised kernel density estimation K-means

Highlights

Abstract

Introduction

Section snippets

Related work

Materials and methods

Experimental setup

Results

Conclusions

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgment

Neurocomputing

Pattern Recognition Letters

Neural Networks

Cognitive Systems Research

Incremental clustering: The case for extra clusters

K-means clustering and related algorithms

Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework

Journal of Multiple-Valued Logic and Soft Computing

An efficient k-means clustering algorithm

Electical Engineering and Computer Science

K-means++: The advantages of careful seeding

Semi-supervised clustering methods

Wiley Interdisciplinary Reviews: Computational Statistics

Learning a mahalanobis metric from equivalence constraints

Journal of Machine Learning Research

Semi-supervised clustering by seeding

Active semi-supervision for pairwise constrained clustering

Constrained k-means clustering

Graphical methods for data analysis

Updating formulae and a pairwise algorithm for computing sample variances

Semi-supervised learning

Incremental induction rules clustering

Incremental learning of daily routines as workflows in a smart home environment

ACM Transactions on Interactive Intelligent Systems

Statistical comparisons of classifiers over multiple data sets

Journal of Machine Learning Research

Incremental learning of concept drift in nonstationary environments

IEEE Transactions on Neural Networks

Non-parametric estimation of a multivariate probability density

Theory of Probability & Its Applications

Cluster analysis of multivariate data: efficiency versus interpretability of classifications

Biometrics

A growing neural gas network learns topologies

Incremental learning algorithms and applications

A note on the utility of incremental learning

AI Communications

Unsupervised and semi-supervised clustering: A brief survey. a review of machine learning techniques for processing multimedia contentReport of the MUSCLE European network of excellence (FP6)

CURE: An efficient clustering algorithm for large databases