Elsevier

Big Data Research

Volume 23, 15 February 2021, 100170
Big Data Research

Flexible Subspace Clustering: A Joint Feature Selection and K-Means Clustering Framework

https://doi.org/10.1016/j.bdr.2020.100170Get rights and content

Abstract

Regarding as an important computing paradigm, cloud computing is to address big and distributed databases and rather simple computation. In this paradigm, data mining is one of the most important and fundamental problems. A large amount of data is generated by sensors and other intelligent devices. Data mining for these big data is crucial in various applications. K-means clustering is a typical technique to group the similar data into the same clustering, and has been commonly used in data mining. However, it is still a challenge to the data containing a large amount of noise, outliers and redundant features. In this paper, we propose a robust K-means clustering algorithm, namely, flexible subspace clustering. The proposed method incorporates feature selection and K-means clustering into a unified framework, which can select the refined features and improve the clustering performance. Moreover, for the purpose of enhancing the robustness, the l2.p-norm is embedded into the objective function. We can flexibly choose appropriate p according to the different data and thus obtain more robust performance. Experimental results verify the presented method has more robust and better performance on benchmark databases compared to the existing approaches.

Introduction

The cloud computing with big data is regarded as an important paradigm, which handles big and distributed databases and rather simple computation. Many interesting studies concentrate on cloud security [3], smart service [2] and mobile cloud computing [1]. However, most of these papers focus on hardware, data storage and management in clouds. Recently, the internet of things (IoT) is gaining increasing attention and many related studies are proposed in various applications such as quality prediction [4], [5], dynamic resource discovery [6], bearing test [7] and feature recognition [8], [9]. IoT and big data have an increasing impact on the future development of cloud computing. In IoT, enormous amount of data generated by sensors and other intelligent devices contain valuable information, but also encompass a large amount of noise, outliers and redundant features. Thus, data mining for these big data is crucial to be suitable for various applications. As one of the most important and fundamental technique in data mining, clustering has been studied a lot and applied in many fields, such as resource scheduling in cloud computing [10], abnormal behavior detection in cloud [11], clinical observation [12], heterogeneous data analysis [13] and so on. Clustering is a kind of unsupervised learning, which groups the similar data points into the same cluster. As for the similarity, the most common used criterion is the distance, and K-means (KM) is a typical algorithm of this criterion.

The classical K-means distributes data points to k different clusters using l2-norm distance. It's simple and easy to be solved, but easily affected by outliers and noises [14]. To overcome this problem, one direction is to use a distance measure that can be more robust. The use of lp-norm is a successful extension. Hathaway et al. [15] conclude that p=1 shows its property of robustness and choosing the value of p can provide better clustering results than fixing p as 1 or 2 but the model could be difficult to be solved. Salem et al. [16] adopt l1-norm to evaluate the similarity between the observation and the centroid, which is shown efficiency and suitable to noisy data and outliers. Cai et al. [17] propose a multi-view K-means clustering based on l2,1-norm. Liang et al. [18] propose a robust K-means using l2,1-norm in the feature space and then extend it to the kernel space. The reform of the distance metric can improve the performance of K-means algorithm, which has been demonstrated in the above literatures.

However, with the development of science and technology, the data in real life is explosive. Big data sets generated from many fields contains a large amount of attributes, and some of which are noise and redundant attributes. It poses a remarkable challenge on the traditional clustering methods. For example, in face recognition applications, given a face image data of 128×128 resolution which is relatively small, it will generate a 16384-dimensional feature vector. This kind of high-dimensional data always contains a large amount of noises, outliers and redundant features. It is difficult to cluster directly, and sometimes leads to high computational complexity and performance degradation [19], especially in K-means and its extensions. To deal with the curse of dimensionality and reduce the noise, outliers and redundant features, an intuitive approach is to conduct dimensionality reduction processing on the data before clustering. Many dimension reduction methods have been studied in the past decades, such as Principal Component Analysis (PCA) [20], Linear Discriminant Analysis (LDA) [21], sparse approximation to discriminant projection learning (SADPL) [22] and Locally Linear Embedding (LLE) [23]. PCAKM is a typical method that sequentially conducts PCA for dimension reduction and K-means for clustering [24]. Yin et al. [25] apply LLE to preprocess the data before performing K-means to make better use of the manifold information. These sequential methods can improve the computational efficiency, but the subspace got from the dimension reduction process may not be the optimal one for the clustering process, so that some researchers believe that the separation of dimension reduction and clustering may result in worse clustering performance [26].

Intuitively, if clustering is embedded into the process of dimension reduction, the performance of clustering may be improved. This kind of methods tries to find the optimal structure of data in the low-dimensional feature space for clustering. They perform K-means and the subspace learning process simultaneously. For example, Ding et al. [28] construct an adaptive framework LDAKM, in which LDA and K-means are jointly implemented, that is, labels are generated by K-means algorithm, and the obtained labels are used by LDA to learn the subspace. Since LDA may fail when the number of samples is very small, several LDA's extensions have been used to replace LDA, and get better results than LDAKM [29], such as Maximum Margin Criterion (MMC) [30], Orthogonal Centroid Method (OCM) [31] and Orthogonal Least Squares Discriminant Analysis (OLSDA) [32]. Hou et al. [29] consider the relation between PCA and K-means, and propose a general subspace clustering framework. This kind of algorithms has been proved to get better results than the sequential algorithms, but they also have some drawbacks. These algorithms all need to compute an approximate solution by eigenvalue decomposition, which will increase the computational burden so that when facing the high-dimensional data, these algorithms may fail. And since the optimal subspace is found by orthogonal linear transformation, it may have difficulty to understand the meaning of the obtained low-dimensional features. Wang et al. [27] construct a special feature selection matrix and propose a fast adaptive subspace clustering algorithm FAKM based on DEC, which can effectively select the most representative subspace without requiring eigenvalue decomposition. FAKM also performs adaptive learning to the K-means part.

Most methods mentioned above are based on the l2-norm distance metric, which is known to be very sensitive to data outliers and noise. Therefore, it is meaningful to build a model with robust distance metric. Recently, l2,p-norm is successfully used to replace l2-norm as distance metric for improving the robustness, such as DCM [33] and l2,p-PCA [34]. In l2,p-PCA, l2,p-norm is incorporated into PCA, and it is robust to outliers and can retain the desirable properties from big data. Inspired by FAKM and l2,p-PCA, we propose a flexible subspace clustering method. Our method flexibly chooses appropriate p according to the data and thus obtains more robust clustering performance. Several experimental results on various datasets prove the effectiveness of the proposed algorithm.

The main contributions of our paper are listed as follows.

  • The proposed algorithm combines the feature selection and clustering into a single framework jointly.

  • The use of l2,p-norm on K-means makes our algorithm robust to noise and redundant features of big data.

  • The proposed approach is neither convex nor Lipschitz continuous, thus it is difficult to be solved directly. We propose an iterative algorithm to optimize it.

The rest of the paper is organized as follows. We propose our model and derive an efficient algorithm to optimize the model in Section 2. In Section 3, the proposed model is evaluated on the benchmark databases. Finally, we draw the conclusion in Section 4.

Section snippets

The proposed method

In this section, we introduce the details about the proposed method for clustering. The main content will be separated into the following several parts including the formulation of the proposed approach, an efficient algorithm, convergence and computational complexity analysis.

Data description

We conduct analytical experiments on seven datasets to evaluate the performance. For each dataset, we preprocess all the values by centralization. These datasets include:

UCI datasets1: We evaluate our algorithm on four datasets: Cars, Wine, Ionosphere, and Ecoli.

USPS Digit Dataset2: The dataset includes 9298 handwritten digital images, all of which are grayscale images of 16

Conclusion

In this paper, we propose a flexible subspace clustering model. Specifically, we first incorporate feature selection and K-means clustering into a single framework, which can select the refined features and improve the clustering performance. Second, we embed the l2,p-norm into the framework to enhance the robustness and retain the desirable properties from big data. Finally, considering the proposed model is neither convex nor Lipschitz continuous, we develop an effective algorithm to solve

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (42)

  • L. Ren et al.

    A wide-deep-sequence model based quality prediction method in industrial process analysis

    IEEE Trans. Neural Netw. Learn. Syst.

    (2020)
  • Z. Li et al.

    Resource discovery based on preference and movement pattern similarity for large-scale social internet of things

    IEEE Int. Things J.

    (2016)
  • X. Wang et al.

    ADTT: a highly-efficient distributed tensor-train decomposition method for IIoT big data

    IEEE Trans. Ind. Inform.

    (2020)
  • G. Li et al.

    Dynamic gesture recognition in the internet of things

    IEEE Access

    (2019)
  • X. Wang et al.

    A tensor-based multi-attributes visual feature recognition method for industrial intelligence

    IEEE Trans. Ind. Inform.

    (2020)
  • X. Wang et al.

    The research on resource scheduling based on fuzzy clustering in cloud computing

  • X. Zhang et al.

    A robust clustering-based abnormal behavior detection system for large-scale cloud

  • V.J. Hodge et al.

    A survey of outlier detection methodologies

    Artif. Intell. Rev.

    (2004)
  • R.J. Hathaway et al.

    Generalized fuzzy c-means clustering strategies using Lp norm distances

    IEEE Trans. Fuzzy Syst.

    (2000)
  • S.B. Salem et al.

    A fast and effective partitional clustering algorithm for large categorical datasets using a k-means based approach

    Comput. Electr. Eng.

    (2018)
  • X. Cai et al.

    Multi-view K-means clustering on big data

  • Cited by (0)

    This work was supported in part by National Natural Science Foundation of China under Grant 62006056 and Grant 61802148, in part by National Statistical Science Research Project of China under Grant 2020LY090, and in part by the Natural Science Foundation of Guangdong Province under Grant 2019A1515011266.

    View full text