Joint optimization of an autoencoder for clustering and embedding

Boubekki, Ahcène; Kampffmeyer, Michael; Brefeld, Ulf; Jenssen, Robert

doi:10.1007/s10994-021-06015-5

Joint optimization of an autoencoder for clustering and embedding

Open access
Published: 21 June 2021

Volume 110, pages 1901–1937, (2021)
Cite this article

Download PDF

You have full access to this open access article

Machine Learning Aims and scope Submit manuscript

Joint optimization of an autoencoder for clustering and embedding

Download PDF

Ahcène Boubekki ORCID: orcid.org/0000-0003-1606-1513¹,
Michael Kampffmeyer¹,
Ulf Brefeld² &
…
Robert Jenssen¹

5994 Accesses
7 Citations
5 Altmetric
Explore all metrics

Abstract

Deep embedded clustering has become a dominating approach to unsupervised categorization of objects with deep neural networks. The optimization of the most popular methods alternates between the training of a deep autoencoder and a k-means clustering of the autoencoder’s embedding. The diachronic setting, however, prevents the former to benefit from valuable information acquired by the latter. In this paper, we present an alternative where the autoencoder and the clustering are learned simultaneously. This is achieved by providing novel theoretical insight, where we show that the objective function of a certain class of Gaussian mixture models (GMM’s) can naturally be rephrased as the loss function of a one-hidden layer autoencoder thus inheriting the built-in clustering capabilities of the GMM. That simple neural network, referred to as the clustering module, can be integrated into a deep autoencoder resulting in a deep clustering model able to jointly learn a clustering and an embedding. Experiments confirm the equivalence between the clustering module and Gaussian mixture models. Further evaluations affirm the empirical relevance of our deep architecture as it outperforms related baselines on several data sets.

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Dongkuan Xu & Yingjie Tian

Data clustering: application and trends

Article 27 November 2022

Gbeminiyi John Oyewole & George Alex Thopil

Autoencoders and their applications in machine learning: a survey

Article Open access 03 February 2024

Kamal Berahmand, Fatemeh Daneshfar, … Yue Xu

1 Introduction

Clustering is one of the oldest and most difficult problems in machine learning and a great deal of different clustering techniques have been proposed in the literature. Perhaps the most prominent clustering approach is the k-means algorithm (Lloyd, 1982). Its simplicity renders the algorithm particularly attractive to practitioners beyond the field of data mining and machine learning (Punj & Stewart, 1983; Gasch & Eisen, 2002; Hennig & Liao, 2013; Frandsen et al., 2015). A variety of related algorithms and extensions have been studied (Dunn, 1973; Krishna & Murty, 1999; Dhillon et al., 2004). The model behind the standard k-means algorithm is an isotropic Gaussian mixture model (Kulis & Jordan, 2012; Lücke & Forster, 2019) (GMM). As such, it can only produce linear partitions of the input space.

A dominating recent line of research aims to alleviate this issue by learning an embedding of the data with a deep autoencoder (DAE) and to have variants of Gaussian mixture models operate on the embedded instances (Xie et al., 2016; Yang et al., 2017; Guo et al., 2017; Fard et al., 2020; Miklautz et al., 2020). This can be referred to as deep embedded clustering. An important reason for the emergence of this modern line of research in clustering is that it combines elements (the deep autoencoder and the GMM clustering) that are theoretically well understood when considered separately. In practice, since the embedding map is nonlinear, the result is that the otherwise linear clustering translates to a nonlinear solution in input space. This approach resembles the rationale behind kernel methods (Guyon et al., 1993). The difference lies in either computing a kernel matrix before learning the clustering versus learning the feature map as the function approximated by the encoder. Another advantage of an autoencoder (AE) based approach is provided by the decoder of the network that allows to map the centroids back into (an approximated version of the) input space. Depending on the kernel function/matrix, this is not always possible with traditional approaches.

However, even with advantages of deep embedded clustering as described above, the fact that the embedding procedure and the clustering procedure are decoupled means that the former cannot benefit from the latter, and vice versa. In the most naive approach, the deep autoencoder is first trained, and then the features output by the autoencoder are clustered. In Xie et al. (2016), an improvement is made by alternatively shrinking the clusters in the embedded space around centroids for then to update the latter using k-means. The shrinking may however lead to an adaptation of the clustering to the embedding instead of the desired clustering-induced embedding. In general, these approaches cannot handle a random initialization which usually leads to noisy embeddings and, in turn, meaningless clusterings.

Despite the obvious need, few deep embedded clustering procedures have been proposed featuring an inherent capability of bringing out clustering structure in a joint and simultaneous procedure. This is clearly due to the lack of a theoretical foundation that could support such an objective. In this paper, we provide novel theoretical insight that enables us to simultaneously optimize both the embedding and the clustering by relying on a neural network interpretation of Gaussian mixture models. Our solution derives from the fact that during the EM, the objective function of a certain class of GMM can be rephrased as the loss function of an autoencoder, as we show as a key result. A network trained with that loss is thus guaranteed to learn a linear clustering. Our contributions are threefold. (1) We state and prove a theorem connecting Gaussian mixture models and autoencoders. (2) We define the clustering module (CM) as a one hidden layer AE with ${\text {softmax}}$ activation trained with the loss resulting from the previous point. (3) We integrate the CM into a deep autoencoder to form a novel class of deep clustering architectures, denoted AE-CM. Building on the theoretical insight, the experiments confirm the empirical equivalence of the clustering module with GMM. As for the AE-CM, the model outperforms on several datasets baselines extending Gaussian mixture models to nonlinear clustering with deep autoencoders.

The remainder of the paper is organized as follows. Section 2 reviews related works and baselines and Sect. 3 provides the theory underpinning our contributions. The clustering module and the AE-CM are introduced in Sect. 4. We report empirical evaluations in Sect. 5. Section 6 concludes the paper.

2 Related work

Recently, several deep learning models have been proposed to group data in an unsupervised manner (Tian et al., 2014; Springenberg 2016; Yang et al., 2016; Kampffmeyer et al., 2019; Ghasedi Dizaji et al., 2017; Haeusser et al., 2018; McConville et al., 2019; Mukherjee et al., 2019; Uğur et al., 2020; Van Gansbeke et al., 2020). Since our main contribution unravels the connection between Gaussian mixture models and autoencoders, models based on associative (Haeusser et al., 2018; Yang et al., 2019), spectral (Tian et al., 2014; Yang et al., 2019; Bianchi et al., 2020) or subspace (Zhang et al., 2019; Miklautz et al., 2020) clusterings are outside the scope of this paper. Furthermore, unlike (Yang et al., 2016; Chang et al., 2017; Kampffmeyer et al., 2019; Ghasedi Dizaji et al., 2017; Guo et al., 2017; Caron et al., 2018; Ji et al., 2019; Van Gansbeke et al., 2020), our model falls within the category of general purpose deep embedded clustering models which builds upon GMM (Xie et al., 2016; Guo et al., 2017; Yang et al., 2017; Fard et al., 2020). We mention that recent works related to specific topics such as perturbation robustness and non-redundant subspace clustering also utilize ideas that are to some degree related to the general concept of deep embedded clustering (Yang et al., 2020; Miklautz et al., 2020). Given the high generality of our method, we detail models that are also built without advanced mechanisms, but can still support them.

One of the first approaches in this latter category of dominating modern deep clustering algorithms is DEC (Xie et al., 2016), which consists of an embedding network coupled with an ad-hoc matrix representing the centroids. During training, the network is trained to shrink the data around their closest centroids in a way similar to t-SNE (Maaten & Hinton, 2008). However, instead of matching the distributions in the input and feature spaces, the target distribution is a mixture of Student-t models inferred from a k-means clustering of the embedding itself. The latter is updated every few iterations to keep track of evolution of the embedding and the centroids are stored in the ad-hoc matrix. In order to avoid sub-optimal solutions when starting from random initialization, the embedding network is pre-trained. Nevertheless, the difference between the two training phases can cause stability issues, such as collapsing centroids. IDEC (Guo et al., 2017) extends DEC and alleviates this issue to some degree by using the reconstruction loss of the DAE also during the training phase. Although IDEC is more robust than DEC, both are highly dependent on the initial embedding as the clustering phase quickly shrinks the data around the centroids. The Deep Clustering Network (DCN) (Yang et al., 2017) further strengthens the deep embedding clustering framework and is based on an architecture comparable to DEC and IDEC but includes hard clustering. The optimization scheme consists of three steps: (1) Optimize the DAE to reconstruct and bring the data-points closer to their closest centroids (similar to k-means); (2) update assignments to the closest clusters; (3) update the centroids iteratively using the assignments. Although alternating optimization helps prevent instabilities during the optimization, DCN still requires initialization schemes in order to reach good solutions. Our proposed model instead, while being theoretically-driven, resolves the need for alternative optimization schemes the previous methods rely on. Since we learn a variant of GMM using a neural network, the optimizations of both the embedding and the clustering profit from each other. In practice, our model is thus less reliant on pre-training schemes.

To some degree our approach is similar the very recent method known as DKM (Fard et al., 2020). The model shares the same architecture as DEC, namely, a deep autoencoder and an ad-hoc matrix representing the centroids. However, here, the model optimizes the network and the centroids simultaneously. To achieve that the loss function includes a term similar to the loss of fuzzy c-means (Bezdek, 2013), i.e, k-means with soft-assignments. The cluster coefficients are computed from the distances to the centroids and normalized using a parameterized ${\text {softmax}}$. In order to convert soft assignments into hard ones, the parameter follows an annealing strategy. In our model, the assignment probability function is indirectly related to the distance to the closest cluster representative, as used in DKM. Our the loss function aims to minimize the distance between the data points and their reconstruction as convex combinations of the centroids. This means that the learning of the centroid and of the assignment probabilities regularize each other, mitigating the need of an annealing strategy.

There are also other recent and promising approaches to clustering that are leveraging deep neural networks. However, these operate in a very different manner compared to deep embedded clustering, which our novel theoretical insight sheds new light on and which enables our new proposed autoencoder optimization for joint clustering and embedding. We nevertheless briefly review some of these approaches and also compare to representatives of such methods in our experiments. In particular, a great deal of works have focus on the generative aspect of Gaussian mixture models and studied variations based on deep generative models such as variational autoencoders (VAE) (Kingma & Welling, 2013) and generative adversarial network (GAN) (Goodfellow et al., 2014).

Adversarial approaches for clustering have been proposed with examples being CatGAN (Springenberg, 2016) and adversarial autoencoders (AAE) (Makhzani et al., 2015). The former optimizes the mutual information and predicts a categorical distribution of classes in the data while maximizing at the same time the robustness against an adversarial generative model. Instead, AAE uses two adversarial networks to impose a Gaussian and a categorical distribution in the latent space. The recent ClusterGAN (Mukherjee et al., 2019) makes an original use of an autoencoder as the input is not a data point but a sample drawn from the product of a multinomial and a Gaussian distributions. The decoded sample is then fed to the discriminator and the encoder. This method is all robust to random network initialization.

Variational approaches that enable clustering include methods like Gaussian Mixture VAE (GMVAE) (Dilokthanakul et al., 2016) and Variational Deep Embedding (VaDE) (Jiang et al., 2016). Although the former explicitly refers to GMMs, both methods are similar. The difference is that the parameters of the Gaussian distributions in embedded space depend only on the cluster index in VaDE whereas GMVAE involves a second random variable. Both models require a pre-trained autoencoder. Variational information bottleneck with Gaussian mixture model (VIB-GMM) (Uğur et al., 2020) also assumes a mixture of Gaussian distributions on the embedding, however the model is trained accordingly to the deep variational information bottleneck (VIB) framework. The latter can be seen as a information theory-driven generalization of GAN and VAE (Tishby et al., 2000; Alemi et al., 2016). VIB-GMM reveals to be a robust alternative to previous approaches as it is able to produce meaningful clusterings from a randomly initialized autoencoder.

3 Theoretical groundwork

Throughout the paper the set of positive integers is denoted by ${\mathbb {N}}^*$. The set of the d-dimensional stochastic vectors is written as ${\mathbb {S}}^d = \{ {\varvec{x}}\in {{\mathbb {R}}}_{\ge 0}^d \, : \, \sum _{i=1}^d x_i = 1\}.$ When the context allows, the ranges of the indices are abbreviated using the upper-bound, e.g., a vector ${\varvec{x}}\in {\mathbb {R}}^d$ decomposes as ${\varvec{x}}=\langle x_j \rangle _{1\le j \le d}= \langle x_j \rangle _d$. The zero vector of ${\mathbb {R}}^d$ is written as ${\varvec{0}}_d$. The notation extends to any number. The identity matrix of ${\mathbb {R}}^{d \times d}$ is denoted by ${\mathbf {I}}_d$.

3.1 From GMMs to Autoencoders

We aim to fit an isotropic Gaussian mixture model with $K \in {\mathbb {N}}^*$ components and a Dirichlet prior on the average cluster responsibilities on a dataset ${\mathcal {X}}=\{{\varvec{x}}_i\}_N \subset {\mathbb {R}}^d$. The expected value of the complete-data log-likelihood function of the model, also called the ${\mathcal {Q}}$-function (Bishop, 2006) is:

$$\begin{aligned} \begin{aligned} {\mathcal {Q}}(\varvec{\Gamma },\varvec{\Phi },{\varvec{\mu }}) = \sum _{i=1}^N \sum _{k=1}^K \gamma _{ik} \big ( \log \phi _k - ||\varvec{x}_i - {\varvec{\mu }}_k||^2 \big ) + \sum _{k=1}^K(\alpha _k-1) \log \tilde{\varvec{\gamma }}_k. \end{aligned} \end{aligned}$$

(1)

In this expression, $z_i \in {\mathbb {N}}$ is the cluster assignment of the data point ${\varvec{x}}_i$, $\gamma _{ik}:= p(z_i=k | {\varvec{x}}_i)$ is the posterior probability of $z_i=k$, also called the responsibility of cluster k on a data-point ${\varvec{x}}_i$, $\phi _{k}:= p(z_i=k)$ is the prior probability or mixture weight of cluster k and ${\varvec{\mu }}_k \in {\mathbb {R}}^{ d}$ is the centroid of cluster k. The average responsibilities of cluster k is $\tilde{ \gamma }_k = \frac{1}{N} \sum _{i=1}^N \gamma _{ik}$ and $\tilde{\varvec{\gamma }} \in {\mathbb {S}}^K$. The concentration hyperparameter of the Dirichlet prior is $\langle \alpha _k \rangle _K \in {\mathbb {R}}^K {\setminus } \{ {\varvec{0}_K} \}$. The co-variance matrices do not appear in this expression as they are all constant and equal to $\frac{1}{2} {\mathbf {I}}_d$ due to the isotropic assumption. We summarize the parameters into matrices $\varvec{\Gamma }\in {\mathbb {R}}^{N \times K}$, $\varvec{\Phi }\in {\mathbb {S}}^{K}$ and ${\varvec{\mu }}\in {\mathbb {R}}^{K \times d}$. Note that the cluster responsibility vector of ${\varvec{x}}_i$ is stochastic, i.e., ${\varvec{\gamma }}_i = \langle \gamma _{ik} \rangle _K \in {\mathbb {S}}^K$.

Although, the isotropic co-variances allow for great simplifications, they restrict the model to spherical Gaussian distributions. We later alleviate this constraint by introducing a deep autoencoder. The Dirichlet prior involves an extra parameter but also smoothens the optimization. Note that these two assumptions make the model a relaxation of the one underlying k-means (Kulis & Jordan, 2012; Lücke & Forster, 2019).

Theorem 1

The maximization of the ${\mathcal {Q}}$-function in Eq. (1) with respect to $\varvec{\Phi }$ yields a term that can be interpreted as the reconstruction loss of an autoencoder.

Proof

During the EM-algorithm, the maximization with respect to $\varvec{\Phi }$ updates the latter as the average cluster responsibility vector:

$$\begin{aligned} \phi _k = \frac{1}{N} \sum _{i=1} \gamma _{ik} = \tilde{ \gamma }_k. \end{aligned}$$

Using this result, the ${\mathcal {Q}}$-function can be rephrased as

$$\begin{aligned} {\mathcal {Q}}(\varvec{\Gamma },{\varvec{\mu }}) = \sum _{i=1}^N \sum _{k=1}^K \gamma _{ik} \log \tilde{\varvec{\gamma }}_k - \sum _{i=1}^N \sum _{k=1}^K \gamma _{ik} ||\varvec{x}_i - {\varvec{\mu }}_k||^2 + \sum _{k=1}^K(\alpha _k-1) \log \tilde{\varvec{\gamma }}_k. \end{aligned}$$

(2)

The first term corresponds to the entropy of $\langle \tilde{ \gamma }_k\rangle _K$ which, given the Dirichlet prior, is constant and can thus be omitted. We expand now $\gamma _{ik}\Vert \varvec{x}_i - {\varvec{\mu }}_k \Vert ^2$ by adding and subtracting the norm of $\bar{{\varvec{x}}}_i = \sum _{k=1}^K \gamma _{ik} {\varvec{\mu }}_k$: For any given $i \in [1 \mathrel {{.}\,{.}}N]$,

$$\begin{aligned} \sum _{k=1}^K \gamma _{ik}||{\varvec{x}}_i - {\varvec{\mu }}_k||^2&= \sum _{k=1}^K \gamma _{ik}||{\varvec{x}}_i||^2 - 2 {\varvec{x}}_i^\top \Big (\sum _{k=1}^K \gamma _{ik}{\varvec{\mu }}_k\Big ) + \sum _{k=1}^K \gamma _{ik}||{\varvec{\mu }}_k||^2 \\ &\quad + ||\bar{{\varvec{x}}}_i||^2 - ||\bar{{\varvec{x}}}_i||^2\\ &= ||{\varvec{x}}_i||^2 - 2 {\varvec{x}}_i^\top \bar{{\varvec{x}}}_i + ||\bar{{\varvec{x}}}_i||^2 \\&\quad + \sum _{k=1}^K \gamma _{ik}||{\varvec{\mu }}_k||^2 - \sum _{k=1}^K \sum _{l=1}^K \gamma _{ik}\gamma _{il} {\varvec{\mu }}_k^T {\varvec{\mu }}_k \\ &= ||{\varvec{x}}_i \!-\! \bar{{\varvec{x}}}_i||^2 \!+\! \sum _{k=1}^K \gamma _{ik}(1 \!-\! \gamma _{ik} )||{\varvec{\mu }}_k||^2 \!-\! \sum _{k=1}^K \sum _{\begin{array}{c} l=1 \\ l\ne k \end{array}}^K \gamma _{ik}\gamma _{il} {\varvec{\mu }}_k^T {\varvec{\mu }}_l. \end{aligned}$$

Note that this simplification grounds on the isotropic assumption. It is not obvious how to reach a similar result without it. The ${\mathcal {Q}}$-function becomes:

$$\begin{aligned} \begin{aligned} {\mathcal {Q}}(\varvec{\Gamma },{\varvec{\mu }})&= - \underbrace{ \sum _{i=1}^N ||{\varvec{x}}_i - \bar{{\varvec{x}}}_i||^2}_{=:E_1} - \underbrace{ \sum _{i=1}^N \sum _{k=1}^K \gamma _{ik}( 1 - \gamma _{ik} )||{\varvec{\mu }}_k||^2}_{=:E_2} \\&\quad + \underbrace{ \sum _{i=1}^N \sum _{k=1}^K \sum _{\begin{array}{c} l=1 \\ l\ne k \end{array}}^K \gamma _{ik}\gamma _{il} {\varvec{\mu }}_k^T {\varvec{\mu }}_l }_{=:E_3} - \underbrace{\sum _{k=1}^K (1-\alpha _k ) \log \tilde{ \gamma }_k}_{=:E_4}. \end{aligned} \end{aligned}$$

(3)

The variable $\bar{{\varvec{x}}}_i$ is actually a function of ${\varvec{x}}$ that factorizes into two functions ${\mathcal {F}}$ and ${\mathcal {G}}$. The former computes $\varvec{\gamma }_{i}$ which, as a posterior probability, is a function of ${\varvec{x}}_i$. The latter is the dot-product with ${\varvec{\mu }}$.

$$\begin{aligned} \begin{aligned} {\mathcal {F}}:&\begin{array}{ccl} {\mathbb {R}}^d &{} \rightarrow &{} {\mathbb {S}}^K \\ {\varvec{x}}&{} \mapsto &{} \varvec{{\mathcal {F}}}( {\varvec{x}}; {\varvec{\eta }} )= \langle p(z=k| {\varvec{x}}) \rangle _K = \varvec{\gamma }\end{array} \\ {\mathcal {G}}:&\begin{array}{ccl} {\mathbb {S}}^K &{} \rightarrow &{} {\mathbb {R}}^d \\ \varvec{\gamma }&{} \mapsto &{} \varvec{{\mathcal {G}}}( \varvec{\gamma }; {\varvec{\mu }} )= \sum _{k=1}^K \gamma _{k} {\varvec{\mu }}_k = \bar{{\varvec{x}}}\end{array} \\ \end{aligned} \end{aligned}$$

(4)

The first term of Eq. (3) can thus be interpreted as the reconstruction term characteristic of a loss of an autoencoder, consisting of the encoder and decoder functions, which are ${\mathcal {F}}$ and ${\mathcal {G}}$, respectively. $\square$

3.2 Analysis of the terms

The four terms of the expansion of the ${\mathcal {Q}}$-function as Eq. (3) gives new insights into the training of GMMs.

$E_1$: Reconstruction The term $E_1$ suggests an autoencoder structure to optimize ${\mathcal {Q}}$. The decoder, ${\mathcal {G}}$, is a linear function. However, without loss of generality, it can be treated in practice as an affine function, i.e. a single layer network with bias. Regarding the encoder ${\mathcal {F}}$, the architecture can be anything as long as it has a stochastic output.

$E_2$: Sparsity and regularization The second term $E_2$ is related to the Gini impurity index (Breiman et al., 1984) applied to $\varvec{\gamma }_{i}$:

$$\begin{aligned} \begin{aligned} \sum _{k=1}^K \gamma _{ik} ( 1 - \gamma _{ik} )\Vert {\varvec{\mu }}_k\Vert ^2 \le \sum _{k=1}^K \gamma _{ik} ( 1 - \gamma _{ik} ) \Vert \varvec{\mu }\Vert ^2_F = \mathbf{Gini}(\varvec{\gamma }_{i}) \Vert \varvec{\mu }\Vert ^2_F, \end{aligned} \end{aligned}$$

where $\Vert .\Vert _F$ is the Frobenius norm. The Gini index is standard in decision tree theory to select branching features and is an equivalent of the entropy. It is nonnegative and null if, and only if, $\varvec{\gamma }_i$ is a one-hot vector. Hence, minimizing this term favors sparse $\varvec{\gamma }_i$ resulting in clearer assignments.

The terms $\Vert {\varvec{\mu }}_k\Vert ^2$ play a role similar to an $\ell _2$-regularization: they prevent the centroids from diverging away from the data-points. However, they may also favor the trivial solution where all the centroids are merged into zero.

$E_3$: Sparsity and cluster merging To study the behavior of $E_3$ during the optimization, let us consider a simple example with one observation and two clusters, i.e, $\varvec{\Gamma }\equiv \varvec{\gamma }_{1}=(\gamma , 1-\gamma )$. If the observation is unambiguously assigned to one cluster, $\varvec{\gamma }_{1}$ is a one-hot vector and $E_3$ is null. If it is not the case, the difference between $E_2$ and $E_3$ factorizes as follows:

$$\begin{aligned} E_2-E_3 = \gamma (1-\gamma ) \big ( \Vert {\varvec{\mu }}_1\Vert ^2 + \Vert {\varvec{\mu }}_2\Vert ^2 - {\varvec{\mu }}_1^T {\varvec{\mu }}_2 \big ) = \gamma (1-\gamma ) \Vert {\varvec{\mu }}_1 - {\varvec{\mu }}_2\Vert ^2. \end{aligned}$$

(5)

The optimization will thus either push $\varvec{\gamma }_{1}$ toward a more sparse vector, or merge the two centroids. Appendix 3 presents an analysis of the role of this term.

$E_4$: Balancing The Dirichlet prior steers the distribution of the cluster assignments. If none of the $\alpha _k$ is null, the prior will push the optimization to use all the clusters, moderating thus the penchant of $E_2$ for the trivial clustering (Yang et al., 2017; Guo et al., 2017).

Note that if ${\varvec{\alpha }} = \left( 1 + \frac{1}{K}\right) {\varvec{1}}_K$, $E_4$ is, up to a constant, equal to the Kullback–Leibler (KL) divergence between a multinomial distribution with parameter $\tilde{\varvec{\gamma }}$ and the uniform multinomial distribution:

$$\begin{aligned} \begin{aligned} \sum _{k=1}^K (1-\alpha _k ) \log \tilde{\varvec{\gamma }}_k = \sum _{k=1}^K (1 - (1 + \frac{1}{K})) \log \tilde{\varvec{\gamma }}_k = D_{KL} \left( \frac{1}{K} {\varvec{1}}_K \Big \Vert \tilde{\varvec{\gamma }} \right) . \end{aligned} \end{aligned}$$

4 Clustering and embedding with Autoencoders

Theorem 1 says that an autoencoder could be involved during the EM optimization of an isotropic GMM. We go one step further and by-pass the EM to directly optimize the ${\mathcal {Q}}$-function in Eq. (3)) using an autoencoder.

4.1 The clustering module

We define the Clustering Module (CM) as the one-hidden layer autoencoder with encoding and decoding functions ${\mathcal {F}}$ and ${\mathcal {G}}$ such as:

$$\begin{aligned} \begin{aligned} {\mathcal {F}}({\varvec{X}})&= {\text {softmax}}({\varvec{X}}{\varvec{W}}_{\text {enc} } + {\varvec{B}}_{\text {enc}}) = {\varvec{\Gamma }} \\ {\mathcal {G}}({\varvec{\Gamma }})&= {\varvec{\Gamma }} {\varvec{W}}_{\text {dec}} + {\varvec{B}}_{\text {dec}}=\bar{{\varvec{X}}}, \end{aligned} \end{aligned}$$

(6)

where ${\varvec{X}}\in {\mathbb {R}}^{N \times d} \sim {\mathcal {X}}$, code representation/cluster responsibilities $\varvec{\Gamma }= \langle \gamma _{ik} \rangle _{N \times K} \in {\mathbb {R}}^{N \times K}$ s.t. $\varvec{\gamma }_{i} \in {\mathbb {S}}^K$, and the reconstruction $\bar{{\varvec{X}}}\in {\mathbb {R}}^{N \times d}$. The weight and bias parameters of the encoder are ${\varvec{W}}_{\text {enc}} \in {\mathbb {R}}^{d \times K}$ and ${\varvec{B}}_{\text {enc}} \in {\mathbb {R}}^{K}$, respectively, and analogously for the decoder ${\varvec{W}}_{\text {dec}} \in {\mathbb {R}}^{K \times d}$ and ${\varvec{B}}_{\text {dec}} \in {\mathbb {R}}^{d}$. The ${\text {softmax}}$ enforces the row-stochasticity of the code, ie., $\varvec{\Gamma }$. The associated loss function is the negative of Eq. (3):

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{\text {CM}}({\mathcal {X}};\Theta )&= { \sum _{i=1}^N ||{\varvec{x}}_i - \bar{{\varvec{x}}}_i||^2} + { \sum _{i=1}^N \sum _{k=1}^K \gamma _{ik}( 1 - \gamma _{ik} )||{\varvec{\mu }}_k||^2} \\&\quad - { \sum _{i=1}^N \sum _{k=1}^K \sum _{\begin{array}{c} l=1 \\ l\ne k \end{array}}^K \gamma _{ik}\gamma _{il} {\varvec{\mu }}_k^T {\varvec{\mu }}_l } + {\sum _{k=1}^K (1-\alpha _k ) \log \tilde{ \gamma }_k}. \end{aligned} \end{aligned}$$

(7)

with $\Theta =\Big ({\varvec{W}}_{\text {enc}},{\varvec{B}}_{\text {enc}},{\varvec{W}}_{\text {dec}},{\varvec{B}}_{\text {dec}}\Big )$. The centroids of the underlying GMM correspond to the images of ${\mathcal {G}}$ of the canonical basis of ${\mathbb {R}}^K$.

Initialization

The CM can be initialized using k-means or any initialization scheme thereof such as k-means++ (Arthur & Vassilvitskii, 2007). In such a case, the column-vectors of ${\varvec{W}}_{\text {dec}}$ are set equal to the desired centroids. The pseudo-inverse of this matrix becomes the encoder’s weights, ${\varvec{W}}_{\text {enc}}$, and both bias vectors are set to null.

Averaging epoch

In practice, the CM will be optimized by mini-batch learning with stochastic gradient descent. In such a procedure, the optimizer updates the positions of the centroids given the current batch. A small batch-size relative to the size of ${\mathcal {X}}$ may cause dispersion of the intermediate centroids. Hence, choosing the final centroids based on the last iteration may be sub-optimal.

We illustrate this phenomenon in Fig. 1. The data consists of $N = 2000$ points in ${\mathbb {R}}^2$ drawn from a mixture of five bi-variate Gaussians ($K = 5$) (gray dots). The data is standardized before processing. A CM is trained in mini-batches of size 20 over 50 epochs using stochastic gradient descent. The concentration is set to $\varvec{\alpha }= {\varvec{5}}_K$. The dispersion of the centroids after each iteration of the last epoch (crosses) is significant. On the other hand, their average positions (squares) provide a good approximation of the true centers (circles). Therefore, we include one extra epoch to any implementation of the CM to compute the average position of the individual centroids over the last iterations.

4.2 Embedding with feature maps

The theory behind GMMs limits the CM to a linear decoder, thus enabling merely a linear partition of the input space. In addition, the isotropy assumption, specific to the CM, bars clusters to spread differently. We alleviate both limitations using a similar approach to that of kernel methods (Guyon et al., 1993): we non-linearly project the input into a feature space where it will be clustered. However, we do not learn the kernel matrices, providing an implicit feature map. Instead, we learn explicitly the feature maps using a deep autoencoder (DAE).

The idea is to optimize the CM and the DAE simultaneously, in order to let the latter find distortions of the input space along the way that guides the CM toward a better optimum. Using a deep autoencoder architecture prevents the optimization to produce degenerate feature maps Guo et al. (2017). It also preserves the generative nature of the model: points in the input space can be generated from a combination of centroids in the feature space. We refer to this model as the AE-CM.

The model consists of a clustering module nested into a deep autoencoder The architecture is illustrated in Fig. 2. The first part of the DAE encodes an input ${\varvec{x}}\in {\mathbb {R}}^d$ into a vector ${\varvec{z}}\in {\mathbb {R}}^p$. Note, CM now works on code representation ${\varvec{z}}$ and not directly on the input ${\varvec{x}}$. The code ${\varvec{z}}$ is fed to the CM and to the decoder of the DAE, yielding two outputs: $\tilde{{\varvec{z}}}$, the CM’s reconstruction of ${\varvec{z}}$, and $\bar{{\varvec{x}}}$, the DAE’s reconstruction of ${\varvec{x}}$.

4.2.1 Adapting the loss function to a deep architecture

Empirical evaluation showed that current gradient descent optimizers (e.g., Adam, Kingma & Ba, 2015) often return sub-optimal solutions when the reconstruction of the deep autoencoder is simply added to the loss of the CM. To help the optimization to find better optima, we add the assumption that the centroids are orthonormal:

$$\begin{aligned} \forall k,l , \, {\varvec{\mu }}_k^T {\varvec{\mu }}_l = \delta _{kl} = \left\{ \begin{array}{cl} 1 &{} \text {if } k=l, \\ 0 &{} \text {otherwise}.\end{array} \right. \end{aligned}$$

(8)

Although the previous formula involves only ${\varvec{\mu }}$ which is learned by the nested clustering module, it affects the surrounding DAE. Indeed, the constraint encourages it to produce an embedding where the centroids can simultaneously be orthonormal and minimize CM’s loss. As a consequence, $E_2$ simplifies and $E_3$ becomes null:

$$\begin{aligned} \begin{aligned} E_2&= \sum _{i=1}^N \sum _{k=1}^K \gamma _{ik}( 1 - \gamma _{ik} )||{\varvec{\mu }}_k||^2 = \sum _{i=1}^N \sum _{k=1}^K \gamma _{ik}( 1 - \gamma _{ik} ) \\ E_3&= \sum _{i=1}^N \sum _{k=1}^K \sum _{\begin{array}{c} l=1 \\ l\ne k \end{array}}^K \gamma _{ik}\gamma _{il} {\varvec{\mu }}_k^T {\varvec{\mu }}_l = 0 \end{aligned} \end{aligned}$$

(9)

Note that the constraint Eq. (8) is satisfied for the “ideal” clustering, in which case clusters will be mapped to the corners of a simplex in the embedding space, as discussed recently in Kampffmeyer et al. (2019). In this perspective, our inclusion of this constraint helps guide the clustering towards the ideal clustering, and at the same time simplifies the loss function.

We employ Lagrange multipliers to integrate the orthonormality constraint. That way the final loss can be stated as follows:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{{\text {AE-CM}}}({\mathcal {X}};\Theta ) =&\beta \sum _{i=1}^N ||{\varvec{x}}_i - \bar{{\varvec{x}}}_i||^2 \text {(Reconstruction DAE)} \\&+ \sum _{i=1}^N ||{\varvec{z}}_i - \tilde{{\varvec{z}}}_i||^2 \text {(Reconstruction CM)}\\&+ \sum _{i=1}^N \sum _{k=1}^K \gamma _{ik} ( 1 - \gamma _{ik} ) \text {(Sparsity)} \\&+ \sum _{k=1}^K (1-\alpha _k)\log ( \tilde{\varvec{\gamma }}_k ) \text {(Dirichlet Prior)} \\&+ \lambda || {\varvec{\mu }}^T {\varvec{\mu }}- {\mathbf {I}}_K ||_1, \text {(Orthonormality)} \end{aligned} \end{aligned}$$

(10)

where $\lambda > 0$ is the Lagrange multiplier and $\beta > 0$ weights the DAE’s reconstruction loss. We choose the $\ell _1$-norm to enforce orthonormality, however other norms can be used.

Note that if the dimension of the embedding p is larger than the number of centroids K, the embedding can always be transformed to satisfy the orthonormality of the centroids. On the other hand, if $K > p$, the assumption becomes restrictive also in term of possible clusterings. Nevertheless, its importance can be reduced with a small $\lambda$. This assumption also helps to avoid the centroids to collapse as their norm is required to be 1. An analysis of the Lagrange multiplier is provided in Appendix 1.

The loss function of the AE-CM thus depends on four hyper-parameters: the weight $\beta >0$, the concentration ${\varvec{\alpha }}\in {\mathbb {S}}^K$, the Lagrange multiplier $\lambda >0$, and the size of the batches $B \in {\mathbb {N}}^*$.

4.2.2 Implementation details

Since the AE-CM builds upon the CM, any implementation also contains an averaging epoch. In case of pre-training, both sub-networks need to be initialized. We favor a straightforward end-to-end training of the DAE (without drop-out or noise) over a few epochs. The clustering module is then initialized using k-means++ on the embedded dataset. Finally, the CM is optimized alone using ${\mathcal {L}}_{\text {CM}}$ for a few epochs.

5 Experiments

In this section, we evaluate the clustering module and the AE-CM on several data sets covering different types of data. To highlight the generality of our method, we rely only fully connected architecture, ie. we do not use convolution layers even for image data sets. That said, we focus on general purpose baselines. The experiments were conducted on an Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz with 32Gb of RAM supported with a NVIDIA(R) Tesla V100 SXM2 32GB GPU.

5.1 Experimental Setting

5.1.1 Datasets

We leverage eight common benchmark data sets in the deep clustering literature plus one synthetic data set:

MNIST LeCun et al. (1989) contains 70, 000 handwritten images of the digits 0 to 9. The images are grayscale with the digits centered in the $28\times 28$ images. The pixel values are normalized before processing.
fMNIST Xiao et al. (2017) contains 70, 000 images of fashion products organized in 10 classes. The images are grayscale with the product centered in the $28\times 28$ images. The pixel values are normalized before processing.
USPS^{Footnote 1} contains 9298 images of digits 0 to 9. The images are grayscale with size $28\times 28$ pixels. The pixel values are normalized before processing.
CIFAR10 Krizhevsky et al. (2009) contains 60, 000 color images of 10 classes of subjects (dogs, cats, airplanes...). Images are of size $32 \times 32$. The pixel values are normalized before processing.
Reuters10k^{Footnote 2}, here abbreviated R10K, consists of 800, 000 news articles. The dataset is pre-processed as in Guo et al. (2017) to return a subset of 10, 000 random samples embedded into a 2, 000-dimensional space (tf-idf transformation) and distributed over 4 (highly) imbalanced categories.
20News^{Footnote 3} contains 18, 846 messages from newsgroups on 20 topics. Features consists of the tf-idf transformation of the 2000 most frequent words.
10 $\times$ 73k Zheng et al. (2017) consists of 73, 233 RNA-transcript belonging to 8 different cell types Jang et al. (2017). The features consists of the $\log$ of the gene expression variance of the 720 genes with the largest variance. The dataset is relatively sparse with $40\%$ of entries null.
Pendigit Alimoglu and Alpaydin (1996) consists of 10, 992 sequences of coordinates on a tablet captured as writers write digits, thus 10 classes. The dataset is normalized before processing.
5 Gaussians consists of $N = 2000$ points in ${\mathbb {R}}^2$ drawn from a mixture of five bi-variate Gaussians ($K = 5$). The dataset is depicted in Fig. 1.

5.1.2 Evaluation metrics

The clustering performance of each model is evaluated using three frequently-used metrics: the Adjusted Rand Index (ARI) (Hubert & Arabie, 1985), the Normalized Mutual Information (NMI) (Estévez et al., 2009), and the clustering accuracy (ACC) (Kuhn, 1955). These metrics range between 0 and 1 where the latter indicates perfect clustering. For legibility, values are always multiplied by 100. For each table, scores not statistically different (t-test $p<0.05$) from the best score of the column are marked in boldface. The $*$ indicates the model with the best run. A failure (−) corresponds to an ARI close to 0.

5.1.3 Baselines

We include two baselines for the CM: k-means (KM), a GMM with full co-variance and an isotropic GMM (iGMM) with uniform mixture weights. The latter differs from the model the clustering module derives but the Dirichlet prior on the responsibilities yields non tractable updates and a Dirichlet prior on the mixture weights harms the performance.

We compare the AE-CM to four baselines reviewed in Sect. 2: DEC (Xie et al., 2016), its extension IDEC (Guo et al., 2017), DCN (Yang et al., 2017) and DKM (Fard et al., 2020). We include as the naive approach (AE+KM) consisting of a trained DAE followed by k-means on the embedding. We also add ClusterGAN (Mukherjee et al., 2019) and VIB-GMM (Uğur et al., 2020) as alternatives based on variational autoencoders (Kingma & Welling, 2013) and generative adversarial networks (Goodfellow et al., 2014), respectively.

Random and pre-trained initialization are indicated with ${}^r$ and ${}^p$, respectively. If omitted, the initialization is random. Every experiment is repeated 20 times.

5.1.4 Implementation

Both CM and AE-CM are implemented using TensorFlow 2.1 (Abadi et al., 2016)^{Footnote 4}. We also re-implemented DEC, IDEC and DCN. All deep models but ClusterGAN and VIB-GMM use the same fully connected autoencoder d-500-500-2000-p-2000-500-500-d and leaky relu activations, where d and p are the input and feature space dimensions, respectively. For ClusterGAN and VIB-GMM, we used the architecture provided in the original code. As well, the DAE reconstruction loss is the mean square error, regardless of the dataset and of the model, except for VIB-GMM on images which requires a cross-entropy loss (it under performs, otherwise). CM and its baselines are trained for up to 150 epochs, deep models for 1000 epochs. The hyper-parameters (batch-size, p, concentration, etc.) are listed in Tables 12 and 13.

5.2 CM: evaluation

Recall that the loss of the clustering module is a lower bound of the objective function of its underlying isotropic GMM which approximates k-means. Moreover, the optimization of the CM is based on gradient descent instead of EM. We compare these three models as a sanity check, and show that, despite the differences, they report similar clustering performance. We also present an ablation study of the loss and a model selection scheme for the CM. An analysis of the hyper-parameters and of the Dirichlet prior are reported in Appendix 1 and 2, respectively.

5.2.1 CM: clustering performance

In this experiment, we compare clustering performances and initialization schemes. Random and k-means++ initializations are indicated with the superscripts ${}^r$ and ${}^p$, respectively. Each experiment is repeated 20 times. We report averages in Table 1. For each dataset, scores not statistically significant different from the highest ($p<.05$) score are marked in boldface. The $*$ indicate the model with the highest best score among its 20 runs. An extended table including standard deviations and best run can be found in Table 11 of Appendix 4. Average runtime for MNIST are reported on Table 2.

Table 1 The clustering performance ($\times 100$) of different models on the selected datasets

Joint optimization of an autoencoder for clustering and embedding

Abstract

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Data clustering: application and trends

Autoencoders and their applications in machine learning: a survey

1 Introduction

2 Related work

3 Theoretical groundwork

3.1 From GMMs to Autoencoders

Theorem 1

Proof

3.2 Analysis of the terms

4 Clustering and embedding with Autoencoders

4.1 The clustering module

4.2 Embedding with feature maps

4.2.1 Adapting the loss function to a deep architecture

4.2.2 Implementation details

5 Experiments

5.1 Experimental Setting

5.1.1 Datasets

5.1.2 Evaluation metrics

5.1.3 Baselines

5.1.4 Implementation

5.2 CM: evaluation

5.2.1 CM: clustering performance

5.2.2 CM: runtimes

5.2.3 CM: ablation study of the loss

5.2.4 CM: model selection

5.3 AE-CM: evaluation

5.3.1 AE-CM: clustering performance

5.3.2 AE-CM: runtimes

5.3.3 AE-CM: robustness to the number of clusters

5.3.4 AE-CM: model selection

5.3.5 AE-CM: embeddings for k-means

5.3.6 AE-CM: sample generation and interpolation

6 Conclusion

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix 1: additional experiments

1.1 UCI datasets

1.2 Scikit-learn benchmark

Appendix 2: supplementary materials related to the clustering module

1.1 CM: hyper-parameters

1.2 CM: asymmetric prior

1.3 CM: merging clusters with \(E_3\)

1.4 CM: clustering performance

1.5 CM: empirical setting

Appendix 3: supplementary materials for AE-CM

1.1 AE-CM: hyper-parameters

1.2 AE-CM: empirical setting

1.3 AE-CM: clustering performance

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation