Abstract

Clustering is an important ingredient of unsupervised learning; classical clustering methods include K-means clustering and hierarchical clustering. These methods may suffer from instability because of their tendency prone to sink into the local optimal solutions of the nonconvex optimization model. In this paper, we propose a new convex clustering method for high-dimensional data based on the sparse group lasso penalty, which can simultaneously group observations and eliminate noninformative features. In this method, the number of clusters can be learned from the data instead of being given in advance as a parameter. We theoretically prove that the proposed method has desirable statistical properties, including a finite sample error bound and feature screening consistency. Furthermore, the semiproximal alternating direction method of multipliers is designed to solve the sparse group lasso convex clustering model, and its convergence analysis is established without any conditions. Finally, the effectiveness of the proposed method is thoroughly demonstrated through simulated experiments and real applications.

1. Introduction

Clustering is an important ingredient of unsupervised learning. It assigns samples into different clusters by minimizing the differences in the same cluster and maximizing the differences between different clusters. As an exploratory data analysis technique, it has been applied in many fields, such as image processing, energy engineering, and social networks [13]. To date, a wide variety of clustering methods have been introduced, including classical clustering methods such as K-means clustering [4, 5] and hierarchical clustering [6, 7]. However, the convexity of the corresponding optimization models cannot be guaranteed in general, so their global optimal solutions are hard to find. In addition, the performances of these methods are strongly dependent on their initial settings. To overcome the aforementioned disadvantages, convex clustering (CC), which provides a convex optimization perspective toward the task of clustering via shrinkage or regularization techniques, has been considered by many researchers, e.g., [811]. Its corresponding optimization model is given bywhere is a given data matrix with observations and features; denotes the Frobenius norm of the matrix; is a tuning parameter that controls the balance between the model fit and the number of clusters; is the index set; are given weights that are generally chosen based on the given data matrix ; is the th row (column) of matrix ; and the most common choices of are , and , which ensure the convexity of model (1). Let be an optimal solution of (1); then, indicates that the observations and are assigned to the same cluster. Note from (1) that if we set , this will lead to the trivial solution , i.e., the partition of the observations into singletons. As increases, some rows of become identical. If we set to be sufficiently large, all rows of become identical, which indicates that all observations are lumped into a single cluster. Hence, the clustering path can be obtained by solving (1) with a range of tuning parameters, and then the number of clusters is determined by learning from the data rather than being given in advance. This fascinating characteristic of convex clustering has attracted increasing attention, see, e.g., [1221].

The theoretical results on cluster recovery of (1) with had been established in [12, 13]. Later, Sun et al. [14] extended these results to the general weighted convex clustering model. Furthermore, in [15], the authors analyzed the statistical properties of (1) with , such as an unbiased estimator of the degrees of freedom and a finite sample bound for the prediction error. The large sample behavior of (1) with fusion penalization is presented in [16]. In [17], the authors present conditions that guarantee that the convex clustering solution path recovers a tree and explicitly describes how the affinity parameters modulate the structure of the recovered tree. Meanwhile, some researchers concentrated on the solution algorithms of (1). For example, Chi and Lange [10] adopted the Alternating Direction Method of Multipliers (ADMM) and Alternating Minimization Algorithm (AMA) to solve (1). More recently, in [14, 18], the semismooth Newton-based augmented Lagrangian method was applied to (1) and outperformed ADMM and AMA in efficiency, scalability, and robustness. Moreover, the idea of convex clustering was applied to process missing data [19], binary data [20], and outlier detection [21].

Although convex clustering (1) has desirable theoretical results and performs well computationally, Wang et al. [22] pointed out that the clustering performance may be destroyed in high-dimensional scenarios where the number of features is much more than observations. The noninformative features included in the clustering are the key reason behind why convex clustering (1) is disheartening. Hence, eliminating some noninformative features is indispensable for high-dimensional data clustering. This motivates us to propose a more reasonable convex clustering method that can perform cluster analysis and feature screening simultaneously.

In this paper, we propose the Sparse Group Lasso Convex Clustering (SGLCC) method by adopting the sparse group lasso penalty [23]. The optimization problem is summarized aswhere the tuning parameter controls the number of informative features and is the weight vector. The form of the penalty term is much like the elastic-net penalty [24], and and denote the Euclidean norm and norm of vector , respectively. The parameter controls the balance between the group lasso [25, 26] and the lasso [27]. Obviously, convex clustering (1) is a special case of (2) if we set . Unlike (1), model (2) can perform variable selection. When , model (2) reduces to sparse convex clustering (SCC) [22]. SCC uses the group lasso penalty to promote group sparsity, that is, all components of the solution are either zero or nonzero. However, it cannot induce ubiquitous within-group sparsity in genetic data. For example, a biological pathway may be implicated in the progression of a particular type of cancer, but not all genes in the pathway need to be active [28]. The sparse group lasso penalty is designed to achieve such within-group sparsity. Therefore, we believe that the proposed clustering model (2) can significantly improve the clustering performance of model (1). It is worth noting that solving the optimization problem (2) is more challenging than convex clustering (1) and SCC. If the optimization algorithms utilized in [10, 22] are adopted to solve (2) directly, the computation efficiency may be poor because the subproblem might have no closed-form solution and the step length is too small. Hence, we developed the steps in this paper along with the theory, algorithm, and applications of SGLCC.

The main contributions of the paper are summarized as follows:(1)For the proposed SGLCC, we obtain a finite sample error bound and feature screening consistency under mild conditions. Our results are not only applicable to the more practical SGLCC model but also subsume the known results of (1) and SCC as special cases.(2)We adopt the semiproximal alternating direction method of multipliers to solve (2), and its convergence analysis is established without any additional conditions. The subproblem per iteration of this algorithm has a closed-form solution and can be implemented efficiently.(3)The experimental results on both synthetic and real datasets illustrate that SGLCC provides superior clustering performance and feature selection abilities to other clustering methods.

The remainder of this paper is organized as follows. In Section 2, we summarize some preliminaries and notations. We study the finite sample error bound and feature screening consistency of our proposed method in Section 3. In Section 4, we introduce an efficient optimization algorithm for solving (2). After that, in Section 5, we conduct numerical experiments to verify the effectiveness of the proposed method in synthetic datasets and four microarray datasets. Conclusions are given in Section 6. The proof of the main results can be found in the Supplementary Materials (available here).

2. Preliminaries

In this section, we introduce some preliminaries that are used later in the paper.

For convenience, we start with some necessary notations. For a vector and , denotes the norm of and denotes a vector with all components equal to one. Let . For a matrix , is the th row (the th column) of , denotes the Frobenius norm of , denotes the spectral norm of , and the vectorization of is defined by . The symbols and denote the Hadamard product and Kronecker product, respectively. The linear operator is given by

The th row of is denoted as , and is a vector with the th component equal to one and the rest equal to zero. The adjoint operator of is given by , that is, .

Next, some definitions and results from convex analysis [29] are provided. Proximal mapping is important for designing optimization algorithms and has been well studied. For a proper closed convex function , the proximal mapping for at any with is defined by

In particular, if is an indicator function of the set ( if ; otherwise ); then, the proximal mapping reduces to the projection operator onto . The Fenchel conjugate of is defined by

A key property that connects the proximal mapping and Fenchel conjugate is the so-called Moreau decomposition:

For example, if with , from Moreau decomposition, we obtainwhere and .

3. Statistical Properties

In this section, we study the finite sample error bound and feature screening consistency of sparse group lasso convex clustering (2). We start with the following conditions that facilitate the technical proofs.A1: assume that , where is the mean vector and is an error vector of independent sub-Gaussian random variables with mean zero and variance A2: only features are informative; the informative feature set and its complementary set are then denoted as and , respectively

The condition A1 implies that the data matrix is composed of sub-Gaussian random variables and is commonly used in the literature, see, e.g., [15, 22]. In high-dimensional scenarios, only a few features are active in general. Thus, condition A2 has been widely used in high-dimensional data analysis, see, e.g., [22, 30]. For simplicity, as described in the literature [12, 13, 15, 22], we consider the case with .

3.1. Bounds for Prediction Error

The following theorems provide the finite sample bounds for prediction error of model (2), and our theoretical results subsume the existing results for CC and SCC as special cases.

Theorem 1. Suppose that condition A1 is satisfied. Let be the solution of model (2) with . If , then there exists positive constants and such that

Here,

From Theorem 1, we can observe that the average prediction error is bounded, and tends to zero as . We know that and by combining condition A2 with . Model (2) with has prediction consistency only if , , , and . To be specific, we find that and tend to zero as by and . If , thenand decays to zero as . Additionally, by including , the condition implies that .

The next theorem presents the finite sample bound for the prediction error of model (2) with .

Theorem 2. Suppose that condition A1 is satisfied. Let be the solution of model (2) with . If , then there exists positive constants and such that

We only discuss the first term on the right-hand side of (11) in Theorem 2 because all other terms are the same as (8) in Theorem 1. We find that by condition A2. Note that , and thus . Suppose that , we obtain that by combining .

Remark 1. (1)In high-dimensional scenarios, the number of features is considerably larger than the number of observations . If , then conditions and hold automatically, and this condition can be found in [15].(2)This intuitive illustration implies that the condition requires that the weight not be too large.(3)Theorems 1 and 2 subsume the known results for CC and SCC by taking and , respectively. In addition, we obtain encouraging results provided in Corollary 1 when .The following corollary follows directly from Theorems 1 and 2 by letting .

Corollary 1. Suppose that conditions A1-A2 are satisfied. Let be the solution of model (2) with . We obtain the following statements.(1)When , if , , and , then, as ,(2)When , if , , and , then, as ,As we can see from Corollary 1, SGLCC with has prediction consistency and only need some conditions that are easy to satisfy.

3.2. Feature Screening Consistency

Feature screening consistency is a desirable property in high-dimensional scenarios where many noninformative features need to be eliminated. This section demonstrates the asymptotic feature screening consistency of SGLCC.

Theorem 3. Let be the solution of model (2) with . Suppose that conditions A1-A2 are hold. If , , , and , then , for any .

Theorem 4. Let be the solution of model (2) with . Suppose that conditions A1-A2 are hold. If , , , and , then , for any .

Theorems 3 and 4 give the asymptotic consistency of feature selection of SGLCC with and , respectively. In this sense, SGLCC possesses the ability to correctly eliminate the noninformative features that distort clustering performance. Hence, we believe that the clustering performance of SGLCC is promising.

4. Algorithmic Design

In this section, we adopt the semiproximal alternating direction method of multipliers to solve (2), and its global convergence can be established. For simplicity, we focus on designing an algorithm to solve (2) with . The same algorithmic design and implementation can be applied to cases and without difficulty.

4.1. Optimality Conditions

By ignoring the terms with , model (2) can be rewritten aswhere . For the convenience of mathematical expression, we define the linear operator and its adjoint operator , respectively, bywhere . Furthermore, model (13) can be reformulated as the following linearly constrained convex optimization problem with a separable structure:where , , and . The Lagrangian function associated with (16) is defined bywhere , , and are Lagrangian multipliers. For a given parameter , the augmented Lagrangian function for (16) is defined by

The Karush–Kuhn–Tucker (KKT) conditions for (16) are given by

Let be a feasible set of (16). Obviously, and . Then, we know from Corollaries 28.2.2 and 28.3.1 in [29] that is an optimal solution for (16) if and only if there exists a Lagrangian multiplier such that satisfies the KKT conditions (19). Hence, we could design a stopping criterion based on the KKT conditions in terms of guaranteeing the optimality of a point generated by the algorithm proposed in Section 4.2.

4.2. Algorithm and Convergence Analysis

Although the optimization problem (16) is a convex optimization problem with linear constraints, it is difficult to minimize all four variables at the same time. Hence, we divide all four variables into two groups and update the two groups of variables alternately. According to the separable structure, we employ the semiproximal alternating direction method of multipliers (sPADMM) to solve (16) and view as one block and as another.

Next, we discuss the iteration schemes of sPADMM for solving the optimization problem (16). Given the th iteration and two self-adjoint positive semidefinite operators and , first update variables and as follows:

Then, update Lagrangian multipliers:

Here, is step length and . It is well known that the sPADMM reduces to the classical ADMM introduced by [31, 32] when and . A comprehensive review of sPADMM can be found in Appendix B of [33].

Remark 2. The choice of the two self-adjoint positive semidefinite linear operators and depends greatly on the problem. The general principle is that both and should be as small as possible, while the corresponding subproblems are still relatively easy to solve.
Each block of (20) involves solving a convex optimization problem. Next, we show the closed-form solutions of each block. First, we deal with the first block in (20):The -subproblem is a smooth, strongly convex optimization that has a unique global minimizer . Finding the minimizer is equivalent to finding the solution of the following linear system:where . To solve this linear system efficiently, we design an appropriate . In our experiment, let , which is a positive semidefinite matrix. Then, , andby the Sherman–Morrison–Woodbury formula [34]. Hence, the closed-form solution for this linear system can be obtained asFor the second block in (20), we let the linear operator since the linear operator of in (16) is an identity operator that ensures the well-defined -subproblem:which can be divided into three parts:Together with the definition of proximal mapping and Moreau decomposition, we show the closed-form solutions of the above subproblems. For any , the -subproblem has a closed-form solution that is given bySimilarly, for any , the solution of the -subproblem is given byThe last subproblem is a soft threshold operator for each column, that is, for any ,Based on the above analysis, the algorithm framework of sPADMM for solving (16) is summarized in Algorithm 1.
The following theorem provides the convergence analysis for Algorithm 1. The proof is inspired by Theorem B.1 in [33], but no additional conditions are required in our proof.

(1)Initialization Let and . Given a start point , for
(2)repeat
(3) Step 1. Update according to (25)
(4) Step 2. Update according to (28), (29) and (30)
(5) Step 3. Update the Lagrangian multipliers by (21)
(6)until Stopping criterion is satisfied.

Theorem 5. If sequence is generated from Algorithm 1, then sequence converges to an optimal solution for (16) and any accumulation point of such that KKT conditions (19) hold.

Proof. We first justify that the conditions required in Theorem B.1 in [33] automatically hold for (16).(a)The existence of a solution for (16): note that the objective function tends to infinity as . Therefore, the level setis bounded by Theorem 4.11 in [35]. Hence, we further know that the minimizer of (16) can be attained by Theorem 4.10 in [35].(b)The constraints of (16) can be rewritten as , whereHence, we find that , so is positive definite. Moreover, is positive definite because is the identity operator.
Based on these ingredients, the remainder of the proofs can be obtained easily (see Appendix B of [33] for details).

5. Numerical Results

The aim of this section is to illustrate the practical performance of sparse group lasso convex clustering (SGLCC), which performs well in theory. We conduct experiments both on synthetic and real datasets and compare the proposed method to K-means clustering, CC [10] and SCC [22]. The clustering results of K-means clustering can be obtained by built-in functions (kmeans) in MATLAB. All numerical experiments were performed in MATLAB R2017b on a personal computer with a 3.30 GHz processor and 4 GB of RAM.

In our experiments, we stop sPADMM based on the relative KKT residual of (16), which is given bywhereand is a given tolerance. If the requirement is not satisfied after a maximum number of iterations (maxiter), we also terminate the algorithm. In our experiments, we set and . Moreover, we use the same method to select the weights and the adaptive weights in all experiments. For the weights , we select the most popular and practical method proposed by [10]:where and . We set and in all experiments. Inspired by [36], we let the adaptive weights , where is the estimator of in (1). As can be seen from models (1) and (2), the tuning parameters and control the number of clusters and the number of informative features, respectively. In our experiments, the principle for selecting and is to make the results match the true number of clusters and the true number of informative features as much as possible, and the Rand index as large as possible. This principle is also used in the literature [21] to determine the tuning parameters.

5.1. Synthetic Data

In this section, we evaluate the performance of SGLCC on synthetic datasets by comparing the results with those obtained with K-means clustering, CC and SCC. We consider three synthetic datasets. Each synthetic dataset consists of observations with either or 4 clusters. The number of features is and ranges from 2000 to 5000, where the percentage of informative features is . The cluster label is uniformly sampled from . Without loss of generality, we assume that the first features are informative and the remaining features are noninformative. The first informative features are generated from an -dimensional normal distribution with mean and covariance , whose entry is , and the noninformative features are generated from a -dimensional standard normal distribution. Here,where is a function whose value is one if the event happens and zero otherwise. Let us illustrate this definition with an example. Specifically, if , the first informative features of the observations with cluster labels 1 and 2 are generated from -dimensional normal distributions with mean and , respectively. In this part, we consider the following three synthetic datasets. Case I: ; Case II: ; Case III: .

We repeat each experiment 50 times and evaluate the performance through four common criteria for cluster analysis. The Rand index [37] and Fowlkes–Mallows index [38] are two measures of the similarity between the estimated clustering result and the underlying true clustering assignment. Their values range from 0 to 1, and a larger value indicates better performance. For simplicity, we shall abbreviate the Rand index and Fowlkes–Mallows index as RI and FMI, respectively. Moreover, the false negative rate (FNR) and false positive rate (FPR) are reported to evaluate the performance of feature screening. All experimental results for the synthetic datasets are summarized in Tables 13.

From Table 1, several interesting observations can be made. (a) When the feature dimensions are high (), CC does not perform well, and its clustering results worsen as increases. (b) Because CC and K-means do not have the ability to eliminate noninformative features, their FNR and FPR are zero and one, respectively. (c) The clustering results of SGLCC and SCC are significantly better than those of CC when the feature dimensions are high because the penalty term is incorporated in convex clustering (1). (d) SGLCC has more stable performance than K-means and CC because it has the smallest standard deviation (SD). (e) SGLCC selects informative features with greater accuracy than the SCC, that is, with a lower FPR and almost the same FNR. The phenomenon mentioned above can also be observed from the results for Case II, see Table 2 for details. It is worth noting that SCC and SGLCC have similar performance for Case II. This reflects the ability of SCC to conduct feature selection, especially for data with group sparsity and dimensions that are not very large. However, compared with that of SGLCC, the performance of SCC becomes unstable as the number of dimensions increases. The reason behind this difference is that SGLCC allows a more nuanced selection of informative features.

The experimental results for Case III are reported in Figure 1 and Table 3. As we can see from Figure 1, SGLCC is significantly better than SCC when there are highly correlated variables in the dataset, and the improvement becomes increasingly evident as increases. The main reason behind this may be the form of the sparse group lasso penalty, which, much like the elastic net, is good at handling the collinearity problem.

5.2. Real Data

In this section, five public datasets are used to validate the performance of our proposed SGLCC method. These datasets have been broadly applied to numerous studies in the literature, which is the primary reason why we selected them. All real datasets are summarized in Table 4. Brain A contains 42 observations with 5597 genes of five clusters that contain 10, 10, 10, 4, and 8 observations. SRBCT contains 63 observations of four clusters, and each observation has 2308 genes. The four clusters contain 23, 8, 12, and 20 observations. Prostate contains 102 observations with 5597 genes of two clusters that contain 50 and 52. Colon contains 62 observations of two clusters that contain 22 and 40 observations, each of which has 2000 genes. ARB contains 590 observations of eight clusters, and each observation has 8266 variables. The eight clusters contain 46, 81, 162, 189, 31, 12, 50, and 19 observations. We used the preprocessing versions of the first four datasets based on [39]. In our experiments, each feature from all of the datasets is centered. The Rand index of K-means, hierarchical clustering (Hclust), spectral clustering (Sclust), CC, SCC, and SGLCC for all five real datasets are reported in Table 5, and the best results are highlighted in bold. Here, the clustering results of Hclust and Sclust are obtained through two built-in functions in MATLAB, namely, clusterdata and spectralcluster.

As we can see from Table 5, the performance of CC is unsatisfactory and worse than that of K-means for Brain A, Colon, and ARB. A similar phenomenon can also be observed in [15, 22]. SGLCC is significantly superior to K-means, Hclust, Sclust, and CC in term of clustering performance, which implies that feature screening is an indispensable part of high-dimensional data analysis. Moreover, SGLCC is better than SCC because a more reasonable penalty term is used. In summary, our proposed SGLCC method has the largest Rand index among the six methods.

6. Conclusion

In this paper, we propose the Sparse Group Lasso Convex Clustering method for high-dimensional datasets, which can determine the number of clusters by learning from the data. SGLCC not only divides the observation set but also eliminates noninformative features. For the proposed clustering method, we provide the theoretical finite sample bounds of the prediction error and feature screening consistency. Moreover, an efficient sPADMM is designed to implement our proposed estimator, and each subproblem yields closed-form solutions. Convergence analysis of the sPADMM is established without any additional conditions. The experimental results illustrate that SGLCC provides superior clustering performance and feature selection abilities than other compared clustering methods. In particular, our method is also very effective for real data applications.

Data Availability

The real datasets used in Section 5.2 are available from http://www.stat.cmu.edu/∼jiashun/Research/software/GenomicsData/ and http://archive.ics.uci.edu/ml/datasets.php.

Conflicts of Interest

The authors declare no conflicts of interest.

Authors’ Contributions

All authors read and approved the final manuscript. H. Chen mainly contributed to the statistical properties, algorithm design, and numerical results; L. Kong mainly contributed to the idea of the model and algorithm design; and Y. Li mainly contributed to the numerical results.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (11431002 and 11671029).

Supplementary Materials

All technical details for proofs of Theorems 3.1–3.4 are included in in this section, which can be found in https://figshare.com/articles/online_resource/Supplementary_Materials_pdf/12579164. (Supplementary Materials)