Flexible Subspace Clustering: A Joint Feature Selection and K-Means Clustering Framework☆
Introduction
The cloud computing with big data is regarded as an important paradigm, which handles big and distributed databases and rather simple computation. Many interesting studies concentrate on cloud security [3], smart service [2] and mobile cloud computing [1]. However, most of these papers focus on hardware, data storage and management in clouds. Recently, the internet of things (IoT) is gaining increasing attention and many related studies are proposed in various applications such as quality prediction [4], [5], dynamic resource discovery [6], bearing test [7] and feature recognition [8], [9]. IoT and big data have an increasing impact on the future development of cloud computing. In IoT, enormous amount of data generated by sensors and other intelligent devices contain valuable information, but also encompass a large amount of noise, outliers and redundant features. Thus, data mining for these big data is crucial to be suitable for various applications. As one of the most important and fundamental technique in data mining, clustering has been studied a lot and applied in many fields, such as resource scheduling in cloud computing [10], abnormal behavior detection in cloud [11], clinical observation [12], heterogeneous data analysis [13] and so on. Clustering is a kind of unsupervised learning, which groups the similar data points into the same cluster. As for the similarity, the most common used criterion is the distance, and K-means (KM) is a typical algorithm of this criterion.
The classical K-means distributes data points to k different clusters using -norm distance. It's simple and easy to be solved, but easily affected by outliers and noises [14]. To overcome this problem, one direction is to use a distance measure that can be more robust. The use of -norm is a successful extension. Hathaway et al. [15] conclude that shows its property of robustness and choosing the value of p can provide better clustering results than fixing p as 1 or 2 but the model could be difficult to be solved. Salem et al. [16] adopt -norm to evaluate the similarity between the observation and the centroid, which is shown efficiency and suitable to noisy data and outliers. Cai et al. [17] propose a multi-view K-means clustering based on -norm. Liang et al. [18] propose a robust K-means using -norm in the feature space and then extend it to the kernel space. The reform of the distance metric can improve the performance of K-means algorithm, which has been demonstrated in the above literatures.
However, with the development of science and technology, the data in real life is explosive. Big data sets generated from many fields contains a large amount of attributes, and some of which are noise and redundant attributes. It poses a remarkable challenge on the traditional clustering methods. For example, in face recognition applications, given a face image data of resolution which is relatively small, it will generate a 16384-dimensional feature vector. This kind of high-dimensional data always contains a large amount of noises, outliers and redundant features. It is difficult to cluster directly, and sometimes leads to high computational complexity and performance degradation [19], especially in K-means and its extensions. To deal with the curse of dimensionality and reduce the noise, outliers and redundant features, an intuitive approach is to conduct dimensionality reduction processing on the data before clustering. Many dimension reduction methods have been studied in the past decades, such as Principal Component Analysis (PCA) [20], Linear Discriminant Analysis (LDA) [21], sparse approximation to discriminant projection learning (SADPL) [22] and Locally Linear Embedding (LLE) [23]. PCAKM is a typical method that sequentially conducts PCA for dimension reduction and K-means for clustering [24]. Yin et al. [25] apply LLE to preprocess the data before performing K-means to make better use of the manifold information. These sequential methods can improve the computational efficiency, but the subspace got from the dimension reduction process may not be the optimal one for the clustering process, so that some researchers believe that the separation of dimension reduction and clustering may result in worse clustering performance [26].
Intuitively, if clustering is embedded into the process of dimension reduction, the performance of clustering may be improved. This kind of methods tries to find the optimal structure of data in the low-dimensional feature space for clustering. They perform K-means and the subspace learning process simultaneously. For example, Ding et al. [28] construct an adaptive framework LDAKM, in which LDA and K-means are jointly implemented, that is, labels are generated by K-means algorithm, and the obtained labels are used by LDA to learn the subspace. Since LDA may fail when the number of samples is very small, several LDA's extensions have been used to replace LDA, and get better results than LDAKM [29], such as Maximum Margin Criterion (MMC) [30], Orthogonal Centroid Method (OCM) [31] and Orthogonal Least Squares Discriminant Analysis (OLSDA) [32]. Hou et al. [29] consider the relation between PCA and K-means, and propose a general subspace clustering framework. This kind of algorithms has been proved to get better results than the sequential algorithms, but they also have some drawbacks. These algorithms all need to compute an approximate solution by eigenvalue decomposition, which will increase the computational burden so that when facing the high-dimensional data, these algorithms may fail. And since the optimal subspace is found by orthogonal linear transformation, it may have difficulty to understand the meaning of the obtained low-dimensional features. Wang et al. [27] construct a special feature selection matrix and propose a fast adaptive subspace clustering algorithm FAKM based on DEC, which can effectively select the most representative subspace without requiring eigenvalue decomposition. FAKM also performs adaptive learning to the K-means part.
Most methods mentioned above are based on the -norm distance metric, which is known to be very sensitive to data outliers and noise. Therefore, it is meaningful to build a model with robust distance metric. Recently, -norm is successfully used to replace -norm as distance metric for improving the robustness, such as DCM [33] and -PCA [34]. In -PCA, -norm is incorporated into PCA, and it is robust to outliers and can retain the desirable properties from big data. Inspired by FAKM and -PCA, we propose a flexible subspace clustering method. Our method flexibly chooses appropriate p according to the data and thus obtains more robust clustering performance. Several experimental results on various datasets prove the effectiveness of the proposed algorithm.
The main contributions of our paper are listed as follows.
- •
The proposed algorithm combines the feature selection and clustering into a single framework jointly.
- •
The use of -norm on K-means makes our algorithm robust to noise and redundant features of big data.
- •
The proposed approach is neither convex nor Lipschitz continuous, thus it is difficult to be solved directly. We propose an iterative algorithm to optimize it.
The rest of the paper is organized as follows. We propose our model and derive an efficient algorithm to optimize the model in Section 2. In Section 3, the proposed model is evaluated on the benchmark databases. Finally, we draw the conclusion in Section 4.
Section snippets
The proposed method
In this section, we introduce the details about the proposed method for clustering. The main content will be separated into the following several parts including the formulation of the proposed approach, an efficient algorithm, convergence and computational complexity analysis.
Data description
We conduct analytical experiments on seven datasets to evaluate the performance. For each dataset, we preprocess all the values by centralization. These datasets include:
UCI datasets1: We evaluate our algorithm on four datasets: Cars, Wine, Ionosphere, and Ecoli.
USPS Digit Dataset2: The dataset includes 9298 handwritten digital images, all of which are grayscale images of 16
Conclusion
In this paper, we propose a flexible subspace clustering model. Specifically, we first incorporate feature selection and K-means clustering into a single framework, which can select the refined features and improve the clustering performance. Second, we embed the -norm into the framework to enhance the robustness and retain the desirable properties from big data. Finally, considering the proposed model is neither convex nor Lipschitz continuous, we develop an effective algorithm to solve
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (42)
- et al.
Towards a comprehensive data analytics framework for smart healthcare services
Big Data Res.
(2016) - et al.
Kluster: an efficient scalable procedure for approximating the number of clusters in unsupervised learning
Big Data Res.
(2018) - et al.
High-order possibilistic c-means algorithms based on tensor decompositions for big data in IoT
Inf. Fusion
(2018) - et al.
Learning a subspace for clustering via pattern shrinking
Inf. Process. Manag.
(2013) - et al.
Regularized soft K-means for discriminant analysis
Neurocomputing
(2013) - et al.
Unsupervised feature analysis with sparse adaptive learning
Pattern Recognit. Lett.
(2018) - et al.
Orthogonal vs. uncorrelated least squares discriminant analysis for feature extraction
Pattern Recognit. Lett.
(2012) - et al.
Using crowdsourcing to provide QoS for mobile cloud computing
IEEE Trans. Cloud Comput.
(2019) - et al.
A semantic approach to cloud security and compliance
- et al.
A data-driven approach of product quality prediction for complex production systems
IEEE Trans. Ind. Inform.
(2020)
A wide-deep-sequence model based quality prediction method in industrial process analysis
IEEE Trans. Neural Netw. Learn. Syst.
Resource discovery based on preference and movement pattern similarity for large-scale social internet of things
IEEE Int. Things J.
ADTT: a highly-efficient distributed tensor-train decomposition method for IIoT big data
IEEE Trans. Ind. Inform.
Dynamic gesture recognition in the internet of things
IEEE Access
A tensor-based multi-attributes visual feature recognition method for industrial intelligence
IEEE Trans. Ind. Inform.
The research on resource scheduling based on fuzzy clustering in cloud computing
A robust clustering-based abnormal behavior detection system for large-scale cloud
A survey of outlier detection methodologies
Artif. Intell. Rev.
Generalized fuzzy c-means clustering strategies using Lp norm distances
IEEE Trans. Fuzzy Syst.
A fast and effective partitional clustering algorithm for large categorical datasets using a k-means based approach
Comput. Electr. Eng.
Multi-view K-means clustering on big data
Cited by (0)
- ☆
This work was supported in part by National Natural Science Foundation of China under Grant 62006056 and Grant 61802148, in part by National Statistical Science Research Project of China under Grant 2020LY090, and in part by the Natural Science Foundation of Guangdong Province under Grant 2019A1515011266.