1 Introduction

Clustering is one of the oldest and most difficult problems in machine learning and a great deal of different clustering techniques have been proposed in the literature. Perhaps the most prominent clustering approach is the k-means algorithm (Lloyd, 1982). Its simplicity renders the algorithm particularly attractive to practitioners beyond the field of data mining and machine learning (Punj & Stewart, 1983; Gasch & Eisen, 2002; Hennig & Liao, 2013; Frandsen et al., 2015). A variety of related algorithms and extensions have been studied (Dunn, 1973; Krishna & Murty, 1999; Dhillon et al., 2004). The model behind the standard k-means algorithm is an isotropic Gaussian mixture model (Kulis & Jordan, 2012; Lücke & Forster, 2019) (GMM). As such, it can only produce linear partitions of the input space.

A dominating recent line of research aims to alleviate this issue by learning an embedding of the data with a deep autoencoder (DAE) and to have variants of Gaussian mixture models operate on the embedded instances (Xie et al., 2016; Yang et al., 2017; Guo et al., 2017; Fard et al., 2020; Miklautz et al., 2020). This can be referred to as deep embedded clustering. An important reason for the emergence of this modern line of research in clustering is that it combines elements (the deep autoencoder and the GMM clustering) that are theoretically well understood when considered separately. In practice, since the embedding map is nonlinear, the result is that the otherwise linear clustering translates to a nonlinear solution in input space. This approach resembles the rationale behind kernel methods (Guyon et al., 1993). The difference lies in either computing a kernel matrix before learning the clustering versus learning the feature map as the function approximated by the encoder. Another advantage of an autoencoder (AE) based approach is provided by the decoder of the network that allows to map the centroids back into (an approximated version of the) input space. Depending on the kernel function/matrix, this is not always possible with traditional approaches.

However, even with advantages of deep embedded clustering as described above, the fact that the embedding procedure and the clustering procedure are decoupled means that the former cannot benefit from the latter, and vice versa. In the most naive approach, the deep autoencoder is first trained, and then the features output by the autoencoder are clustered. In Xie et al. (2016), an improvement is made by alternatively shrinking the clusters in the embedded space around centroids for then to update the latter using k-means. The shrinking may however lead to an adaptation of the clustering to the embedding instead of the desired clustering-induced embedding. In general, these approaches cannot handle a random initialization which usually leads to noisy embeddings and, in turn, meaningless clusterings.

Despite the obvious need, few deep embedded clustering procedures have been proposed featuring an inherent capability of bringing out clustering structure in a joint and simultaneous procedure. This is clearly due to the lack of a theoretical foundation that could support such an objective. In this paper, we provide novel theoretical insight that enables us to simultaneously optimize both the embedding and the clustering by relying on a neural network interpretation of Gaussian mixture models. Our solution derives from the fact that during the EM, the objective function of a certain class of GMM can be rephrased as the loss function of an autoencoder, as we show as a key result. A network trained with that loss is thus guaranteed to learn a linear clustering. Our contributions are threefold. (1) We state and prove a theorem connecting Gaussian mixture models and autoencoders. (2) We define the clustering module (CM) as a one hidden layer AE with \({\text {softmax}}\) activation trained with the loss resulting from the previous point. (3) We integrate the CM into a deep autoencoder to form a novel class of deep clustering architectures, denoted AE-CM. Building on the theoretical insight, the experiments confirm the empirical equivalence of the clustering module with GMM. As for the AE-CM, the model outperforms on several datasets baselines extending Gaussian mixture models to nonlinear clustering with deep autoencoders.

The remainder of the paper is organized as follows. Section 2 reviews related works and baselines and Sect. 3 provides the theory underpinning our contributions. The clustering module and the AE-CM are introduced in Sect. 4. We report empirical evaluations in Sect. 5. Section 6 concludes the paper.

2 Related work

Recently, several deep learning models have been proposed to group data in an unsupervised manner (Tian et al., 2014; Springenberg 2016; Yang et al., 2016; Kampffmeyer et al., 2019; Ghasedi Dizaji et al., 2017; Haeusser et al., 2018; McConville et al., 2019; Mukherjee et al., 2019; Uğur et al., 2020; Van Gansbeke et al., 2020). Since our main contribution unravels the connection between Gaussian mixture models and autoencoders, models based on associative (Haeusser et al., 2018; Yang et al., 2019), spectral (Tian et al., 2014; Yang et al., 2019; Bianchi et al., 2020) or subspace (Zhang et al., 2019; Miklautz et al., 2020) clusterings are outside the scope of this paper. Furthermore, unlike (Yang et al., 2016; Chang et al., 2017; Kampffmeyer et al., 2019; Ghasedi Dizaji et al., 2017; Guo et al., 2017; Caron et al., 2018; Ji et al., 2019; Van Gansbeke et al., 2020), our model falls within the category of general purpose deep embedded clustering models which builds upon GMM (Xie et al., 2016; Guo et al., 2017; Yang et al., 2017; Fard et al., 2020). We mention that recent works related to specific topics such as perturbation robustness and non-redundant subspace clustering also utilize ideas that are to some degree related to the general concept of deep embedded clustering (Yang et al., 2020; Miklautz et al., 2020). Given the high generality of our method, we detail models that are also built without advanced mechanisms, but can still support them.

One of the first approaches in this latter category of dominating modern deep clustering algorithms is DEC (Xie et al., 2016), which consists of an embedding network coupled with an ad-hoc matrix representing the centroids. During training, the network is trained to shrink the data around their closest centroids in a way similar to t-SNE (Maaten & Hinton, 2008). However, instead of matching the distributions in the input and feature spaces, the target distribution is a mixture of Student-t models inferred from a k-means clustering of the embedding itself. The latter is updated every few iterations to keep track of evolution of the embedding and the centroids are stored in the ad-hoc matrix. In order to avoid sub-optimal solutions when starting from random initialization, the embedding network is pre-trained. Nevertheless, the difference between the two training phases can cause stability issues, such as collapsing centroids. IDEC (Guo et al., 2017) extends DEC and alleviates this issue to some degree by using the reconstruction loss of the DAE also during the training phase. Although IDEC is more robust than DEC, both are highly dependent on the initial embedding as the clustering phase quickly shrinks the data around the centroids. The Deep Clustering Network (DCN) (Yang et al., 2017) further strengthens the deep embedding clustering framework and is based on an architecture comparable to DEC and IDEC but includes hard clustering. The optimization scheme consists of three steps: (1) Optimize the DAE to reconstruct and bring the data-points closer to their closest centroids (similar to k-means); (2) update assignments to the closest clusters; (3) update the centroids iteratively using the assignments. Although alternating optimization helps prevent instabilities during the optimization, DCN still requires initialization schemes in order to reach good solutions. Our proposed model instead, while being theoretically-driven, resolves the need for alternative optimization schemes the previous methods rely on. Since we learn a variant of GMM using a neural network, the optimizations of both the embedding and the clustering profit from each other. In practice, our model is thus less reliant on pre-training schemes.

To some degree our approach is similar the very recent method known as DKM (Fard et al., 2020). The model shares the same architecture as DEC, namely, a deep autoencoder and an ad-hoc matrix representing the centroids. However, here, the model optimizes the network and the centroids simultaneously. To achieve that the loss function includes a term similar to the loss of fuzzy c-means (Bezdek, 2013), i.e, k-means with soft-assignments. The cluster coefficients are computed from the distances to the centroids and normalized using a parameterized \({\text {softmax}}\). In order to convert soft assignments into hard ones, the parameter follows an annealing strategy. In our model, the assignment probability function is indirectly related to the distance to the closest cluster representative, as used in DKM. Our the loss function aims to minimize the distance between the data points and their reconstruction as convex combinations of the centroids. This means that the learning of the centroid and of the assignment probabilities regularize each other, mitigating the need of an annealing strategy.

There are also other recent and promising approaches to clustering that are leveraging deep neural networks. However, these operate in a very different manner compared to deep embedded clustering, which our novel theoretical insight sheds new light on and which enables our new proposed autoencoder optimization for joint clustering and embedding. We nevertheless briefly review some of these approaches and also compare to representatives of such methods in our experiments. In particular, a great deal of works have focus on the generative aspect of Gaussian mixture models and studied variations based on deep generative models such as variational autoencoders (VAE) (Kingma & Welling, 2013) and generative adversarial network (GAN) (Goodfellow et al., 2014).

Adversarial approaches for clustering have been proposed with examples being CatGAN (Springenberg, 2016) and adversarial autoencoders (AAE) (Makhzani et al., 2015). The former optimizes the mutual information and predicts a categorical distribution of classes in the data while maximizing at the same time the robustness against an adversarial generative model. Instead, AAE uses two adversarial networks to impose a Gaussian and a categorical distribution in the latent space. The recent ClusterGAN (Mukherjee et al., 2019) makes an original use of an autoencoder as the input is not a data point but a sample drawn from the product of a multinomial and a Gaussian distributions. The decoded sample is then fed to the discriminator and the encoder. This method is all robust to random network initialization.

Variational approaches that enable clustering include methods like Gaussian Mixture VAE (GMVAE) (Dilokthanakul et al., 2016) and Variational Deep Embedding (VaDE) (Jiang et al., 2016). Although the former explicitly refers to GMMs, both methods are similar. The difference is that the parameters of the Gaussian distributions in embedded space depend only on the cluster index in VaDE whereas GMVAE involves a second random variable. Both models require a pre-trained autoencoder. Variational information bottleneck with Gaussian mixture model (VIB-GMM) (Uğur et al., 2020) also assumes a mixture of Gaussian distributions on the embedding, however the model is trained accordingly to the deep variational information bottleneck (VIB) framework. The latter can be seen as a information theory-driven generalization of GAN and VAE (Tishby et al., 2000; Alemi et al., 2016). VIB-GMM reveals to be a robust alternative to previous approaches as it is able to produce meaningful clusterings from a randomly initialized autoencoder.

3 Theoretical groundwork

Throughout the paper the set of positive integers is denoted by \({\mathbb {N}}^*\). The set of the d-dimensional stochastic vectors is written as \({\mathbb {S}}^d = \{ {\varvec{x}}\in {{\mathbb {R}}}_{\ge 0}^d \, : \, \sum _{i=1}^d x_i = 1\}.\) When the context allows, the ranges of the indices are abbreviated using the upper-bound, e.g., a vector \({\varvec{x}}\in {\mathbb {R}}^d\) decomposes as \({\varvec{x}}=\langle x_j \rangle _{1\le j \le d}= \langle x_j \rangle _d\). The zero vector of \({\mathbb {R}}^d\) is written as \({\varvec{0}}_d\). The notation extends to any number. The identity matrix of \({\mathbb {R}}^{d \times d}\) is denoted by \({\mathbf {I}}_d\).

3.1 From GMMs to Autoencoders

We aim to fit an isotropic Gaussian mixture model with \(K \in {\mathbb {N}}^*\) components and a Dirichlet prior on the average cluster responsibilities on a dataset \({\mathcal {X}}=\{{\varvec{x}}_i\}_N \subset {\mathbb {R}}^d\). The expected value of the complete-data log-likelihood function of the model, also called the \({\mathcal {Q}}\)-function (Bishop, 2006) is:

$$\begin{aligned} \begin{aligned} {\mathcal {Q}}(\varvec{\Gamma },\varvec{\Phi },{\varvec{\mu }}) = \sum _{i=1}^N \sum _{k=1}^K \gamma _{ik} \big ( \log \phi _k - ||\varvec{x}_i - {\varvec{\mu }}_k||^2 \big ) + \sum _{k=1}^K(\alpha _k-1) \log \tilde{\varvec{\gamma }}_k. \end{aligned} \end{aligned}$$
(1)

In this expression, \(z_i \in {\mathbb {N}}\) is the cluster assignment of the data point \({\varvec{x}}_i\), \(\gamma _{ik}:= p(z_i=k | {\varvec{x}}_i)\) is the posterior probability of \(z_i=k\), also called the responsibility of cluster k on a data-point \({\varvec{x}}_i\), \(\phi _{k}:= p(z_i=k)\) is the prior probability or mixture weight of cluster k and \({\varvec{\mu }}_k \in {\mathbb {R}}^{ d}\) is the centroid of cluster k. The average responsibilities of cluster k is \(\tilde{ \gamma }_k = \frac{1}{N} \sum _{i=1}^N \gamma _{ik}\) and \(\tilde{\varvec{\gamma }} \in {\mathbb {S}}^K\). The concentration hyperparameter of the Dirichlet prior is \(\langle \alpha _k \rangle _K \in {\mathbb {R}}^K {\setminus } \{ {\varvec{0}_K} \}\). The co-variance matrices do not appear in this expression as they are all constant and equal to \(\frac{1}{2} {\mathbf {I}}_d\) due to the isotropic assumption. We summarize the parameters into matrices \(\varvec{\Gamma }\in {\mathbb {R}}^{N \times K}\), \(\varvec{\Phi }\in {\mathbb {S}}^{K}\) and \({\varvec{\mu }}\in {\mathbb {R}}^{K \times d}\). Note that the cluster responsibility vector of \({\varvec{x}}_i\) is stochastic, i.e., \({\varvec{\gamma }}_i = \langle \gamma _{ik} \rangle _K \in {\mathbb {S}}^K\).

Although, the isotropic co-variances allow for great simplifications, they restrict the model to spherical Gaussian distributions. We later alleviate this constraint by introducing a deep autoencoder. The Dirichlet prior involves an extra parameter but also smoothens the optimization. Note that these two assumptions make the model a relaxation of the one underlying k-means (Kulis & Jordan, 2012; Lücke & Forster, 2019).

Theorem 1

The maximization of the \({\mathcal {Q}}\)-function in Eq. (1) with respect to \(\varvec{\Phi }\) yields a term that can be interpreted as the reconstruction loss of an autoencoder.

Proof

During the EM-algorithm, the maximization with respect to \(\varvec{\Phi }\) updates the latter as the average cluster responsibility vector:

$$\begin{aligned} \phi _k = \frac{1}{N} \sum _{i=1} \gamma _{ik} = \tilde{ \gamma }_k. \end{aligned}$$

Using this result, the \({\mathcal {Q}}\)-function can be rephrased as

$$\begin{aligned} {\mathcal {Q}}(\varvec{\Gamma },{\varvec{\mu }}) = \sum _{i=1}^N \sum _{k=1}^K \gamma _{ik} \log \tilde{\varvec{\gamma }}_k - \sum _{i=1}^N \sum _{k=1}^K \gamma _{ik} ||\varvec{x}_i - {\varvec{\mu }}_k||^2 + \sum _{k=1}^K(\alpha _k-1) \log \tilde{\varvec{\gamma }}_k. \end{aligned}$$
(2)

The first term corresponds to the entropy of \(\langle \tilde{ \gamma }_k\rangle _K\) which, given the Dirichlet prior, is constant and can thus be omitted. We expand now \(\gamma _{ik}\Vert \varvec{x}_i - {\varvec{\mu }}_k \Vert ^2\) by adding and subtracting the norm of \(\bar{{\varvec{x}}}_i = \sum _{k=1}^K \gamma _{ik} {\varvec{\mu }}_k\): For any given \(i \in [1 \mathrel {{.}\,{.}}N]\),

$$\begin{aligned} \sum _{k=1}^K \gamma _{ik}||{\varvec{x}}_i - {\varvec{\mu }}_k||^2&= \sum _{k=1}^K \gamma _{ik}||{\varvec{x}}_i||^2 - 2 {\varvec{x}}_i^\top \Big (\sum _{k=1}^K \gamma _{ik}{\varvec{\mu }}_k\Big ) + \sum _{k=1}^K \gamma _{ik}||{\varvec{\mu }}_k||^2 \\ &\quad + ||\bar{{\varvec{x}}}_i||^2 - ||\bar{{\varvec{x}}}_i||^2\\ &= ||{\varvec{x}}_i||^2 - 2 {\varvec{x}}_i^\top \bar{{\varvec{x}}}_i + ||\bar{{\varvec{x}}}_i||^2 \\&\quad + \sum _{k=1}^K \gamma _{ik}||{\varvec{\mu }}_k||^2 - \sum _{k=1}^K \sum _{l=1}^K \gamma _{ik}\gamma _{il} {\varvec{\mu }}_k^T {\varvec{\mu }}_k \\ &= ||{\varvec{x}}_i \!-\! \bar{{\varvec{x}}}_i||^2 \!+\! \sum _{k=1}^K \gamma _{ik}(1 \!-\! \gamma _{ik} )||{\varvec{\mu }}_k||^2 \!-\! \sum _{k=1}^K \sum _{\begin{array}{c} l=1 \\ l\ne k \end{array}}^K \gamma _{ik}\gamma _{il} {\varvec{\mu }}_k^T {\varvec{\mu }}_l. \end{aligned}$$

Note that this simplification grounds on the isotropic assumption. It is not obvious how to reach a similar result without it. The \({\mathcal {Q}}\)-function becomes:

$$\begin{aligned} \begin{aligned} {\mathcal {Q}}(\varvec{\Gamma },{\varvec{\mu }})&= - \underbrace{ \sum _{i=1}^N ||{\varvec{x}}_i - \bar{{\varvec{x}}}_i||^2}_{=:E_1} - \underbrace{ \sum _{i=1}^N \sum _{k=1}^K \gamma _{ik}( 1 - \gamma _{ik} )||{\varvec{\mu }}_k||^2}_{=:E_2} \\&\quad + \underbrace{ \sum _{i=1}^N \sum _{k=1}^K \sum _{\begin{array}{c} l=1 \\ l\ne k \end{array}}^K \gamma _{ik}\gamma _{il} {\varvec{\mu }}_k^T {\varvec{\mu }}_l }_{=:E_3} - \underbrace{\sum _{k=1}^K (1-\alpha _k ) \log \tilde{ \gamma }_k}_{=:E_4}. \end{aligned} \end{aligned}$$
(3)

The variable \(\bar{{\varvec{x}}}_i\) is actually a function of \({\varvec{x}}\) that factorizes into two functions \({\mathcal {F}}\) and \({\mathcal {G}}\). The former computes \(\varvec{\gamma }_{i}\) which, as a posterior probability, is a function of \({\varvec{x}}_i\). The latter is the dot-product with \({\varvec{\mu }}\).

$$\begin{aligned} \begin{aligned} {\mathcal {F}}:&\begin{array}{ccl} {\mathbb {R}}^d &{} \rightarrow &{} {\mathbb {S}}^K \\ {\varvec{x}}&{} \mapsto &{} \varvec{{\mathcal {F}}}( {\varvec{x}}; {\varvec{\eta }} )= \langle p(z=k| {\varvec{x}}) \rangle _K = \varvec{\gamma }\end{array} \\ {\mathcal {G}}:&\begin{array}{ccl} {\mathbb {S}}^K &{} \rightarrow &{} {\mathbb {R}}^d \\ \varvec{\gamma }&{} \mapsto &{} \varvec{{\mathcal {G}}}( \varvec{\gamma }; {\varvec{\mu }} )= \sum _{k=1}^K \gamma _{k} {\varvec{\mu }}_k = \bar{{\varvec{x}}}\end{array} \\ \end{aligned} \end{aligned}$$
(4)

The first term of Eq. (3) can thus be interpreted as the reconstruction term characteristic of a loss of an autoencoder, consisting of the encoder and decoder functions, which are \({\mathcal {F}}\) and \({\mathcal {G}}\), respectively. \(\square\)

3.2 Analysis of the terms

The four terms of the expansion of the \({\mathcal {Q}}\)-function as Eq. (3) gives new insights into the training of GMMs.

\(E_1\): Reconstruction The term \(E_1\) suggests an autoencoder structure to optimize \({\mathcal {Q}}\). The decoder, \({\mathcal {G}}\), is a linear function. However, without loss of generality, it can be treated in practice as an affine function, i.e. a single layer network with bias. Regarding the encoder \({\mathcal {F}}\), the architecture can be anything as long as it has a stochastic output.

\(E_2\): Sparsity and regularization The second term \(E_2\) is related to the Gini impurity index (Breiman et al., 1984) applied to \(\varvec{\gamma }_{i}\):

$$\begin{aligned} \begin{aligned} \sum _{k=1}^K \gamma _{ik} ( 1 - \gamma _{ik} )\Vert {\varvec{\mu }}_k\Vert ^2 \le \sum _{k=1}^K \gamma _{ik} ( 1 - \gamma _{ik} ) \Vert \varvec{\mu }\Vert ^2_F = \mathbf{Gini}(\varvec{\gamma }_{i}) \Vert \varvec{\mu }\Vert ^2_F, \end{aligned} \end{aligned}$$

where \(\Vert .\Vert _F\) is the Frobenius norm. The Gini index is standard in decision tree theory to select branching features and is an equivalent of the entropy. It is nonnegative and null if, and only if, \(\varvec{\gamma }_i\) is a one-hot vector. Hence, minimizing this term favors sparse \(\varvec{\gamma }_i\) resulting in clearer assignments.

The terms \(\Vert {\varvec{\mu }}_k\Vert ^2\) play a role similar to an \(\ell _2\)-regularization: they prevent the centroids from diverging away from the data-points. However, they may also favor the trivial solution where all the centroids are merged into zero.

\(E_3\): Sparsity and cluster merging To study the behavior of \(E_3\) during the optimization, let us consider a simple example with one observation and two clusters, i.e, \(\varvec{\Gamma }\equiv \varvec{\gamma }_{1}=(\gamma , 1-\gamma )\). If the observation is unambiguously assigned to one cluster, \(\varvec{\gamma }_{1}\) is a one-hot vector and \(E_3\) is null. If it is not the case, the difference between \(E_2\) and \(E_3\) factorizes as follows:

$$\begin{aligned} E_2-E_3 = \gamma (1-\gamma ) \big ( \Vert {\varvec{\mu }}_1\Vert ^2 + \Vert {\varvec{\mu }}_2\Vert ^2 - {\varvec{\mu }}_1^T {\varvec{\mu }}_2 \big ) = \gamma (1-\gamma ) \Vert {\varvec{\mu }}_1 - {\varvec{\mu }}_2\Vert ^2. \end{aligned}$$
(5)

The optimization will thus either push \(\varvec{\gamma }_{1}\) toward a more sparse vector, or merge the two centroids. Appendix 3 presents an analysis of the role of this term.

\(E_4\): Balancing The Dirichlet prior steers the distribution of the cluster assignments. If none of the \(\alpha _k\) is null, the prior will push the optimization to use all the clusters, moderating thus the penchant of \(E_2\) for the trivial clustering (Yang et al., 2017; Guo et al., 2017).

Note that if \({\varvec{\alpha }} = \left( 1 + \frac{1}{K}\right) {\varvec{1}}_K\), \(E_4\) is, up to a constant, equal to the Kullback–Leibler (KL) divergence between a multinomial distribution with parameter \(\tilde{\varvec{\gamma }}\) and the uniform multinomial distribution:

$$\begin{aligned} \begin{aligned} \sum _{k=1}^K (1-\alpha _k ) \log \tilde{\varvec{\gamma }}_k = \sum _{k=1}^K (1 - (1 + \frac{1}{K})) \log \tilde{\varvec{\gamma }}_k = D_{KL} \left( \frac{1}{K} {\varvec{1}}_K \Big \Vert \tilde{\varvec{\gamma }} \right) . \end{aligned} \end{aligned}$$

4 Clustering and embedding with Autoencoders

Theorem 1 says that an autoencoder could be involved during the EM optimization of an isotropic GMM. We go one step further and by-pass the EM to directly optimize the \({\mathcal {Q}}\)-function in Eq. (3)) using an autoencoder.

4.1 The clustering module

We define the Clustering Module (CM) as the one-hidden layer autoencoder with encoding and decoding functions \({\mathcal {F}}\) and \({\mathcal {G}}\) such as:

$$\begin{aligned} \begin{aligned} {\mathcal {F}}({\varvec{X}})&= {\text {softmax}}({\varvec{X}}{\varvec{W}}_{\text {enc} } + {\varvec{B}}_{\text {enc}}) = {\varvec{\Gamma }} \\ {\mathcal {G}}({\varvec{\Gamma }})&= {\varvec{\Gamma }} {\varvec{W}}_{\text {dec}} + {\varvec{B}}_{\text {dec}}=\bar{{\varvec{X}}}, \end{aligned} \end{aligned}$$
(6)

where \({\varvec{X}}\in {\mathbb {R}}^{N \times d} \sim {\mathcal {X}}\), code representation/cluster responsibilities \(\varvec{\Gamma }= \langle \gamma _{ik} \rangle _{N \times K} \in {\mathbb {R}}^{N \times K}\) s.t. \(\varvec{\gamma }_{i} \in {\mathbb {S}}^K\), and the reconstruction \(\bar{{\varvec{X}}}\in {\mathbb {R}}^{N \times d}\). The weight and bias parameters of the encoder are \({\varvec{W}}_{\text {enc}} \in {\mathbb {R}}^{d \times K}\) and \({\varvec{B}}_{\text {enc}} \in {\mathbb {R}}^{K}\), respectively, and analogously for the decoder \({\varvec{W}}_{\text {dec}} \in {\mathbb {R}}^{K \times d}\) and \({\varvec{B}}_{\text {dec}} \in {\mathbb {R}}^{d}\). The \({\text {softmax}}\) enforces the row-stochasticity of the code, ie., \(\varvec{\Gamma }\). The associated loss function is the negative of Eq. (3):

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{\text {CM}}({\mathcal {X}};\Theta )&= { \sum _{i=1}^N ||{\varvec{x}}_i - \bar{{\varvec{x}}}_i||^2} + { \sum _{i=1}^N \sum _{k=1}^K \gamma _{ik}( 1 - \gamma _{ik} )||{\varvec{\mu }}_k||^2} \\&\quad - { \sum _{i=1}^N \sum _{k=1}^K \sum _{\begin{array}{c} l=1 \\ l\ne k \end{array}}^K \gamma _{ik}\gamma _{il} {\varvec{\mu }}_k^T {\varvec{\mu }}_l } + {\sum _{k=1}^K (1-\alpha _k ) \log \tilde{ \gamma }_k}. \end{aligned} \end{aligned}$$
(7)

with \(\Theta =\Big ({\varvec{W}}_{\text {enc}},{\varvec{B}}_{\text {enc}},{\varvec{W}}_{\text {dec}},{\varvec{B}}_{\text {dec}}\Big )\). The centroids of the underlying GMM correspond to the images of \({\mathcal {G}}\) of the canonical basis of \({\mathbb {R}}^K\).

Initialization

The CM can be initialized using k-means or any initialization scheme thereof such as k-means++ (Arthur & Vassilvitskii, 2007). In such a case, the column-vectors of \({\varvec{W}}_{\text {dec}}\) are set equal to the desired centroids. The pseudo-inverse of this matrix becomes the encoder’s weights, \({\varvec{W}}_{\text {enc}}\), and both bias vectors are set to null.

Averaging epoch

Fig. 1
figure 1

The intermediate centroids of the last epoch are spread, whereas their averages almost match the true centroids

In practice, the CM will be optimized by mini-batch learning with stochastic gradient descent. In such a procedure, the optimizer updates the positions of the centroids given the current batch. A small batch-size relative to the size of \({\mathcal {X}}\) may cause dispersion of the intermediate centroids. Hence, choosing the final centroids based on the last iteration may be sub-optimal.

We illustrate this phenomenon in Fig. 1. The data consists of \(N = 2000\) points in \({\mathbb {R}}^2\) drawn from a mixture of five bi-variate Gaussians (\(K = 5\)) (gray dots). The data is standardized before processing. A CM is trained in mini-batches of size 20 over 50 epochs using stochastic gradient descent. The concentration is set to \(\varvec{\alpha }= {\varvec{5}}_K\). The dispersion of the centroids after each iteration of the last epoch (crosses) is significant. On the other hand, their average positions (squares) provide a good approximation of the true centers (circles). Therefore, we include one extra epoch to any implementation of the CM to compute the average position of the individual centroids over the last iterations.

4.2 Embedding with feature maps

The theory behind GMMs limits the CM to a linear decoder, thus enabling merely a linear partition of the input space. In addition, the isotropy assumption, specific to the CM, bars clusters to spread differently. We alleviate both limitations using a similar approach to that of kernel methods (Guyon et al., 1993): we non-linearly project the input into a feature space where it will be clustered. However, we do not learn the kernel matrices, providing an implicit feature map. Instead, we learn explicitly the feature maps using a deep autoencoder (DAE).

Fig. 2
figure 2

Schematic representation of the AE-CM. Combining a clustering module and a deep autoencoder allows to jointly learn a clustering and an embedding

The idea is to optimize the CM and the DAE simultaneously, in order to let the latter find distortions of the input space along the way that guides the CM toward a better optimum. Using a deep autoencoder architecture prevents the optimization to produce degenerate feature maps Guo et al. (2017). It also preserves the generative nature of the model: points in the input space can be generated from a combination of centroids in the feature space. We refer to this model as the AE-CM.

The model consists of a clustering module nested into a deep autoencoder The architecture is illustrated in Fig. 2. The first part of the DAE encodes an input \({\varvec{x}}\in {\mathbb {R}}^d\) into a vector \({\varvec{z}}\in {\mathbb {R}}^p\). Note, CM now works on code representation \({\varvec{z}}\) and not directly on the input \({\varvec{x}}\). The code \({\varvec{z}}\) is fed to the CM and to the decoder of the DAE, yielding two outputs: \(\tilde{{\varvec{z}}}\), the CM’s reconstruction of \({\varvec{z}}\), and \(\bar{{\varvec{x}}}\), the DAE’s reconstruction of \({\varvec{x}}\).

4.2.1 Adapting the loss function to a deep architecture

Empirical evaluation showed that current gradient descent optimizers (e.g., Adam, Kingma & Ba, 2015) often return sub-optimal solutions when the reconstruction of the deep autoencoder is simply added to the loss of the CM. To help the optimization to find better optima, we add the assumption that the centroids are orthonormal:

$$\begin{aligned} \forall k,l , \, {\varvec{\mu }}_k^T {\varvec{\mu }}_l = \delta _{kl} = \left\{ \begin{array}{cl} 1 &{} \text {if } k=l, \\ 0 &{} \text {otherwise}.\end{array} \right. \end{aligned}$$
(8)

Although the previous formula involves only \({\varvec{\mu }}\) which is learned by the nested clustering module, it affects the surrounding DAE. Indeed, the constraint encourages it to produce an embedding where the centroids can simultaneously be orthonormal and minimize CM’s loss. As a consequence, \(E_2\) simplifies and \(E_3\) becomes null:

$$\begin{aligned} \begin{aligned} E_2&= \sum _{i=1}^N \sum _{k=1}^K \gamma _{ik}( 1 - \gamma _{ik} )||{\varvec{\mu }}_k||^2 = \sum _{i=1}^N \sum _{k=1}^K \gamma _{ik}( 1 - \gamma _{ik} ) \\ E_3&= \sum _{i=1}^N \sum _{k=1}^K \sum _{\begin{array}{c} l=1 \\ l\ne k \end{array}}^K \gamma _{ik}\gamma _{il} {\varvec{\mu }}_k^T {\varvec{\mu }}_l = 0 \end{aligned} \end{aligned}$$
(9)

Note that the constraint Eq. (8) is satisfied for the “ideal” clustering, in which case clusters will be mapped to the corners of a simplex in the embedding space, as discussed recently in Kampffmeyer et al. (2019). In this perspective, our inclusion of this constraint helps guide the clustering towards the ideal clustering, and at the same time simplifies the loss function.

We employ Lagrange multipliers to integrate the orthonormality constraint. That way the final loss can be stated as follows:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{{\text {AE-CM}}}({\mathcal {X}};\Theta ) =&\beta \sum _{i=1}^N ||{\varvec{x}}_i - \bar{{\varvec{x}}}_i||^2 \text {(Reconstruction DAE)} \\&+ \sum _{i=1}^N ||{\varvec{z}}_i - \tilde{{\varvec{z}}}_i||^2 \text {(Reconstruction CM)}\\&+ \sum _{i=1}^N \sum _{k=1}^K \gamma _{ik} ( 1 - \gamma _{ik} ) \text {(Sparsity)} \\&+ \sum _{k=1}^K (1-\alpha _k)\log ( \tilde{\varvec{\gamma }}_k ) \text {(Dirichlet Prior)} \\&+ \lambda || {\varvec{\mu }}^T {\varvec{\mu }}- {\mathbf {I}}_K ||_1, \text {(Orthonormality)} \end{aligned} \end{aligned}$$
(10)

where \(\lambda > 0\) is the Lagrange multiplier and \(\beta > 0\) weights the DAE’s reconstruction loss. We choose the \(\ell _1\)-norm to enforce orthonormality, however other norms can be used.

Note that if the dimension of the embedding p is larger than the number of centroids K, the embedding can always be transformed to satisfy the orthonormality of the centroids. On the other hand, if \(K > p\), the assumption becomes restrictive also in term of possible clusterings. Nevertheless, its importance can be reduced with a small \(\lambda\). This assumption also helps to avoid the centroids to collapse as their norm is required to be 1. An analysis of the Lagrange multiplier is provided in Appendix 1.

The loss function of the AE-CM thus depends on four hyper-parameters: the weight \(\beta >0\), the concentration \({\varvec{\alpha }}\in {\mathbb {S}}^K\), the Lagrange multiplier \(\lambda >0\), and the size of the batches \(B \in {\mathbb {N}}^*\).

4.2.2 Implementation details

Since the AE-CM builds upon the CM, any implementation also contains an averaging epoch. In case of pre-training, both sub-networks need to be initialized. We favor a straightforward end-to-end training of the DAE (without drop-out or noise) over a few epochs. The clustering module is then initialized using k-means++ on the embedded dataset. Finally, the CM is optimized alone using \({\mathcal {L}}_{\text {CM}}\) for a few epochs.

5 Experiments

In this section, we evaluate the clustering module and the AE-CM on several data sets covering different types of data. To highlight the generality of our method, we rely only fully connected architecture, ie. we do not use convolution layers even for image data sets. That said, we focus on general purpose baselines. The experiments were conducted on an Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz with 32Gb of RAM supported with a NVIDIA(R) Tesla V100 SXM2 32GB GPU.

5.1 Experimental Setting

5.1.1 Datasets

We leverage eight common benchmark data sets in the deep clustering literature plus one synthetic data set:

  • MNIST LeCun et al. (1989) contains 70, 000 handwritten images of the digits 0 to 9. The images are grayscale with the digits centered in the \(28\times 28\) images. The pixel values are normalized before processing.

  • fMNIST Xiao et al. (2017) contains 70, 000 images of fashion products organized in 10 classes. The images are grayscale with the product centered in the \(28\times 28\) images. The pixel values are normalized before processing.

  • USPSFootnote 1 contains 9298 images of digits 0 to 9. The images are grayscale with size \(28\times 28\) pixels. The pixel values are normalized before processing.

  • CIFAR10 Krizhevsky et al. (2009) contains 60, 000 color images of 10 classes of subjects (dogs, cats, airplanes...). Images are of size \(32 \times 32\). The pixel values are normalized before processing.

  • Reuters10kFootnote 2, here abbreviated R10K, consists of 800, 000 news articles. The dataset is pre-processed as in Guo et al. (2017) to return a subset of 10, 000 random samples embedded into a 2, 000-dimensional space (tf-idf transformation) and distributed over 4 (highly) imbalanced categories.

  • 20NewsFootnote 3 contains 18, 846 messages from newsgroups on 20 topics. Features consists of the tf-idf transformation of the 2000 most frequent words.

  • 10 \(\times\) 73k Zheng et al. (2017) consists of 73, 233 RNA-transcript belonging to 8 different cell types Jang et al. (2017). The features consists of the \(\log\) of the gene expression variance of the 720 genes with the largest variance. The dataset is relatively sparse with \(40\%\) of entries null.

  • Pendigit Alimoglu and Alpaydin (1996) consists of 10, 992 sequences of coordinates on a tablet captured as writers write digits, thus 10 classes. The dataset is normalized before processing.

  • 5 Gaussians consists of \(N = 2000\) points in \({\mathbb {R}}^2\) drawn from a mixture of five bi-variate Gaussians (\(K = 5\)). The dataset is depicted in Fig. 1.

5.1.2 Evaluation metrics

The clustering performance of each model is evaluated using three frequently-used metrics: the Adjusted Rand Index (ARI) (Hubert & Arabie, 1985), the Normalized Mutual Information (NMI) (Estévez et al., 2009), and the clustering accuracy (ACC) (Kuhn, 1955). These metrics range between 0 and 1 where the latter indicates perfect clustering. For legibility, values are always multiplied by 100. For each table, scores not statistically different (t-test \(p<0.05\)) from the best score of the column are marked in boldface. The \(*\) indicates the model with the best run. A failure (−) corresponds to an ARI close to 0.

5.1.3 Baselines

We include two baselines for the CM: k-means (KM), a GMM with full co-variance and an isotropic GMM (iGMM) with uniform mixture weights. The latter differs from the model the clustering module derives but the Dirichlet prior on the responsibilities yields non tractable updates and a Dirichlet prior on the mixture weights harms the performance.

We compare the AE-CM to four baselines reviewed in Sect. 2: DEC (Xie et al., 2016), its extension IDEC (Guo et al., 2017), DCN (Yang et al., 2017) and DKM (Fard et al., 2020). We include as the naive approach (AE+KM) consisting of a trained DAE followed by k-means on the embedding. We also add ClusterGAN (Mukherjee et al., 2019) and VIB-GMM (Uğur et al., 2020) as alternatives based on variational autoencoders (Kingma & Welling, 2013) and generative adversarial networks (Goodfellow et al., 2014), respectively.

Random and pre-trained initialization are indicated with \({}^r\) and \({}^p\), respectively. If omitted, the initialization is random. Every experiment is repeated 20 times.

5.1.4 Implementation

Both CM and AE-CM are implemented using TensorFlow 2.1 (Abadi et al., 2016)Footnote 4. We also re-implemented DEC, IDEC and DCN. All deep models but ClusterGAN and VIB-GMM use the same fully connected autoencoder d-500-500-2000-p-2000-500-500-d and leaky relu activations, where d and p are the input and feature space dimensions, respectively. For ClusterGAN and VIB-GMM, we used the architecture provided in the original code. As well, the DAE reconstruction loss is the mean square error, regardless of the dataset and of the model, except for VIB-GMM on images which requires a cross-entropy loss (it under performs, otherwise). CM and its baselines are trained for up to 150 epochs, deep models for 1000 epochs. The hyper-parameters (batch-size, p, concentration, etc.) are listed in Tables 12 and 13.

5.2 CM: evaluation

Recall that the loss of the clustering module is a lower bound of the objective function of its underlying isotropic GMM which approximates k-means. Moreover, the optimization of the CM is based on gradient descent instead of EM. We compare these three models as a sanity check, and show that, despite the differences, they report similar clustering performance. We also present an ablation study of the loss and a model selection scheme for the CM. An analysis of the hyper-parameters and of the Dirichlet prior are reported in Appendix 1 and 2, respectively.

5.2.1 CM: clustering performance

In this experiment, we compare clustering performances and initialization schemes. Random and k-means++ initializations are indicated with the superscripts \({}^r\) and \({}^p\), respectively. Each experiment is repeated 20 times. We report averages in Table 1. For each dataset, scores not statistically significant different from the highest (\(p<.05\)) score are marked in boldface. The \(*\) indicate the model with the highest best score among its 20 runs. An extended table including standard deviations and best run can be found in Table 11 of Appendix 4. Average runtime for MNIST are reported on Table 2.

Table 1 The clustering performance (\(\times 100\)) of different models on the selected datasets

As expected, the clustering module performs similarly to iGMM and k-means on every dataset with respect to almost every metrics and for any initialization scheme. Often the k-means++ initialization does not improve the results. In the case of the clustering module the difference is never significant except on the 20News dataset.

5.2.2 CM: runtimes

We compare here the runtimes of the different method on MNIST. For a fair comparison, we do not use any early-topping criterion and all the methods are run for exactly 150 epochs. We report on Table 2 average over 10 runs.

Table 2 Average runtime of each model to cluster MNIST in 150 epochs

It appears clearly that EM-based models are much faster. Interestingly GMM is slower than iGMM, although they share the same implementation. This difference is certainly due to the extra computations needed to update the covariance matrices.

5.2.3 CM: ablation study of the loss

The loss function of the CM arises as a whole from the \({\mathcal {Q}}\) function of the underlying GMM. Nevertheless, for additional insight, we perform here an ablation study of its terms. We train a CM with different combinations of the terms of its original loss (Eq. (7)). To highlight the influence of each term, we focus on the 5 Gaussians dataset (depicted in Fig. 1). Table 3 reports the clustering performance in terms of ARI.

Table 3 Clustering performance (ARI) of the CM on the 5 Gaussians dataset trained with various combination of the terms of its original loss (first line)

The gap in terms of ARI between the CM trained with the complete loss (first line) and any other combination of its terms confirms the coherence of the loss. The model trained without the reconstruction term \(E_0\) reports the worse score. The ablation of the single terms indicate that the reconstruction term \(E_0\) is the most important followed by the sparsity term \(E_1\). Although the removal of only \(E_3\) has the least impact, it is the only term that combined with \(E_0\) reports better performance than a loss function made of solely the reconstruction term \(E_0\).

Fig. 3
figure 3

Evolution of the total loss, each of its term and of the clustering metrics during the training of the CM on the 5 Gaussians dataset

Figure 3 illustrates the behavior of each term of the loss of the CM (\({\mathcal {L}}_{CM}\)) during the training. Given the variations of each curves, it seems that during the first 25 epochs the optimization focuses on minimizing the reconstruction term even if it implies an increase of the other ones. However, the complete loss curve does not flatten out until \(E_3\) reaches its minimum. From there, a second phase begins where the total loss and the clustering metrics slowly grow in opposite directions. Interestingly, as the metrics increase and the clustering improves, \(E_3\) also increases, which is contrary to the expected behavior. On the other hand, \(E_2\) which is of the same magnitude as \(E_3\) and is also influenced by the sparsity of the assignments, continuously decreases without reaching a minimum. Overall, these curves suggest that both \(E_2\) and \(E_3\) could be used to either stop early the training or selecting the best run.

5.2.4 CM: model selection

In an unsupervised scenario, true labels are not available. It is thus necessary to have an internal measure to avoid selecting a sub-optimal solution.

There are two natural choices: select either the clustering associated with the lowest loss or the less ambiguous clustering. In the first case, the sparsity of the clusters responsibilities \(\varvec{\gamma }_i\) might be eclipsed by other aspects optimized by the \({\mathcal {L}}_{\text {CM}}\), such as the reconstruction term. On the other hand, by selecting only given the sparsity, we may end up always choosing the most degenerate clustering. Leaning on the analysis of Fig. 3, we propose to use \({\mathcal {L}}_{\text {sp}}=E_2+E_3\) which sums both terms of the loss governing the sparsity of the \(\varvec{\gamma }_i\)s but also involves the norm of the \({\varvec{\mu }}_k\)s.

In Table 4, we report the ARI of the runs with the lowest \({\mathcal {L}}_{\text {sp}}\) for each dataset. For comparison, we also report the average and the largest ARI. Scores selected according to \({\mathcal {L}}_{\text {sp}}\) that were higher than the average are marked in boldface.

Table 4 Adjusted Rand index of the run with lowest \({\mathcal {L}}_{\text {sp}}\), the average run and the best run

A model selection based on \({\mathcal {L}}_{\text {sp}}\) finds the best runs only for R10K. Nevertheless, it selects runs with ARI greater than the average in more than half of the cases. In the other cases, the difference to the average score remains below 1 point of ARI expect for CM\({}^p\) on 20News. The average absolute difference with the best score is 1.60 and 3.10 for CM\({}^r\) and CM\({}^p\), respectively. Without 20News, on which CM\({}^p\) performs the worst, that average difference drops to 2.25 for CM\({}^p\). These are satisfying results that substantiates our heuristic that \({\mathcal {L}}_{\text {sp}}=E_2+E_3\) can be used as an internal metric for the CM.

5.3 AE-CM: evaluation

In this section, we compare the clustering performance of our novel deep clustering model AE-CM against a set of baselines. We study the robustness of the model with respect to the number of clusters and a model selection scheme. We also evaluate the quality of the embeddings through the k-means clusterings thereof. Finally, we review the generative capabilities of our model.

5.3.1 AE-CM: clustering performance

We compare now clustering performances and initialization schemes for representative deep clustering models. We reports average ARI, NMI and ACC over 20 runs in Table 5. An extended table including standard deviations and best run can be found in Table 15 of Appendix 3.

Table 5 The clustering scores (\(\times 100\)) of representative deep clustering models on the selected datasets

In their original papers, the DEC, IDEC and DCN are pre-trained (\({}^p\)). We report here slightly lower scores that we ascribe to our implementation and slightly different architectures. Nonetheless, the take-home message here is the consistency of their poor results when randomly initialized. This reflects an inability to produce cluster from scratch, regardless of implementation. Note that, in that case, even k-means outperforms all of them on all the datasets (see Table 1). On the other hand, DKM does return competitive score for both initialization scheme. Such results yield the question of how much clustering that is actually performed by DEC, IDEC and DCN and how much that is due to the pre-training phase.

In practice, DKM has proven sensitive to the choice of its \(\lambda\) hyper-parameter and to the duration of the optimization. For example, we could not find a value able to cluster Pendigit. We conjecture that expanding the clustering term of DKM’s loss, as we did between Eqs. (1) and (3), would improve the robustness of the model.

On six of the datasets, at least one of the variants of AE-CM reports the highest average or highest best run. Especially, AE-CM\({}^r\) produces competitive clusterings on all datasets except CIFAR10 despite its random initialization. This setback is expected as clustering models are known to fail to cluster color images from the raw pixels (Jiang et al., 2016; Hu et al., 2017).

Also our AE-CM with random initialization always surpasses AE+KM except on R10K and CIFAR10, and even outperforms all the competitors by at least 20 ARI points on 20News. On the down side, AE-CM\({}^r\) is associated with large standard deviations which implies a less predictable optimization (see Table 15 of Appendix 3). Therefore, we investigate an internal measure to select the best run.

5.3.2 AE-CM: runtimes

We compare here the runtimes of the different method on MNIST. For a fair comparison, all the methods use the same batch size of 256 instances. We report on Table 6 average over 10 runs. We do not used any early-stopping criterion.

Table 6 Average runtime of each model to cluster MNIST in 150 epochs

Note that our implementation of DEC, IDEC and DCN are based on that of AE-CM. Hence these are the most comparable. The advantage goes to the model joint optimization, AE-CM, which is more that 5 min faster. ClusterGAN is the slowest method. It also has the most complex architecture.

5.3.3 AE-CM: robustness to the number of clusters

In the previous experiments, we provided the true number of clusters to all algorithms for all datasets. In this experiment, we investigate the behavior of the AE-CM when it is set with a different number of clusters on four datasets: MNIST, USPS, R10K and Pendigit. Figure 4 shows the evolution of the ARI (left) and the homogeneity (Rosenberg & Hirschberg, 2007) score (right). The latter measures the purity in terms of true labels of each cluster. The number of cluster varies from 5 to 20. The correct value for all datasets is 10.

Fig. 4
figure 4

Adjusted Rand index and homogeneity score versus number of clusters. The correct number of clusters being \(K=10\)

The ARI curves (left plot) reach their maximum at 10 and then decrease. This behavior is expected since this metric (as well as NMI and ACC) penalizes the number of clusters. On the other hand, the homogeneity curves (right plot) increase with K and stabilize for K larger than 10. The convergence of these curves indicates that the clustering performance of the AE-CM do not degrade if K is set larger than the ground truth. Such a results suggests that, when K is larger than the ground truth, the AE-CM finds solutions that are partitions of those found with smaller K. Such a phenomenon is illustrated in Appendix 3.

5.3.4 AE-CM: model selection

Similarly to the CM, we discuss here a model selection heuristic for the AE-CM. The rationale behind the use of a DAE is to have an encoding facilitating the objective of the clustering module. Hence, we propose to use the heuristic of the CM (Sect. 5.2.4). In Table 7, we report the ARI of the runs with the lowest \({\mathcal {L}}_{\text {sp}}\) for each dataset. For comparison, we also report the average and best ARI. Selected scores greater than the average are marked in boldface.

Table 7 Adjusted Rand index of the run with lowest \({\mathcal {L}}_{\text {sp}}\), the average run and the best run

Again, scores associated to the lowest \({\mathcal {L}}_{\text {sp}}\) are better than the average more than half of the time. The criterion detects the best runs of AE-CM\({}^r\) on MNIST and CIFAR10 and of AE-CM\({}^p\) on MNIST. The average absolute difference to the highest score is 6.5 and 6.6 ARI points for AE-CM\({}^r\) and AE-CM\({}^p\), respectively. In summary, although \({\mathcal {L}}_{\text {sp}}\) as a criterion does not necessarily select the best run, it filters out the worst runs.

5.3.5 AE-CM: embeddings for k-means

All baselines, including our model, are non-linear extensions of k-means and aim to improve the AE+KM. We audit the methods by running k-means on the embeddings produced by the 20 runs with random initialization on the MNIST dataset computed for Table 5 and report the average ARI, NMI and ACC in Table 8.

Table 8 Average clustering performance of k-means on different embeddings.

First, KM reports the worse results. This means that applying k-means on a feature space learned by an autoencoder does improve the quality of the clustering. Next, the results clearly show the superiority of methods utilizing a joint optimization, i.e., DKM\({}^r\)+KM and our AE-CM\({}^r\)+KM. Interestingly, the scores of DCN\({}^r\)+KM are here better than those of DCN\({}^r\). This discrepancy is certainly due the moving average used to update the centroids.

Fig. 5
figure 5

UMAP representation of a subset of MNIST and embeddings thereof learned by AE, IDEC and AE-CM. The squares indicate the centroids

We continue the analysis of the embeddings with UMAP McInnes et al. (1802) projections. Figure 5 depicts the projections of different embeddings of the same 2000 data-points. Figure 5a represents the UMAP projection from the input space. For (5b), we used the best run of the AE+KM. For consistency, we used that embedding for the AE-CM\({}^p\). Hence, Fig. 5c does not show the best run the AE-CM\({}^p\). Finally (5d) is based on the best run of the AE-CM\({}^r\).

The projection of MNIST from the input space (5a) has two pairs of classes entangled: (3, 5) and (4, 9). The end-to-end training of the DAE (5b) successfully isolates each class except for cluster 4 (dark pink) and cluster 9 (light blue) which stay grouped together, although separable. The AE-CM\({}^p\) (5c) further contracts the cluster around the centroids found by the AE+KM, but fails to separate 4 and 9. Remark that even the best run of the AE-CM\({}^p\) does not to correctly split the data points. The centroids for 4 and 9 in Fig. 5b and c are in comparable positions: they align along the gap separating the true clusters. This suggests that the optimization of the AE-CM\({}^p\) did not move them much. This remark applies to the pre-trained baselines, as well. Lastly, the AE-CM\({}^r\) successfully produces homogeneous groups (5f). Remark that the original entanglements of the pairs (3, 5) and (4, 9) are suggested by the trails between the corresponding clusters.

The previous observations summarize into two insights on the behavior of the AE-CM. If the AE-CM starts with an embedding associated to a low reconstruction loss for the DAE, the optimization contracts the clusters which yields higher ARI scores. However, it is unable to move the centroids to reach another local optimum. Although the AE-CM\({}^r\) separates (4, 9), it also produces clusters more spread than those of the AE-CM\({}^p\). The improved performances of the latter over AE+KM indicates that the AE-CM\({}^r\) would benefit from tighter groups.

5.3.6 AE-CM: sample generation and interpolation

Thanks to the reversible feature maps obtained by the DAE, both the AE-CM and its baselines (except DEC) are generative models. Figure 6 shows the decoding of the centroids of the best run of AE+KM (ARI=67.7), IDEC\({}^p\) (ARI=77.2), DKM\({}^r\) (ARI=83.6) and AE-CM\({}^r\) (ARI=88.6). AE+KM’s and IDEC\({}^p\)’s centroids for the 4 and 9 both look like 9’s. With an ARI and an ACC larger than 80 and 90, respectively, DKM\({}^r\) and AE-CM\({}^r\) both clustered the data correctly and found correct centroids for each class. Both models produce clear images for each class, which align reasonably well with the washed-out average image of the respective classes (first row).

Fig. 6
figure 6

Centroids mapped back to image space for AE+KM, IDEC\({}^p\), DKM\({}^r\), and AE-CM\({}^r\). The first row displays the average image of each class

Fig. 7
figure 7

Linear interpolations between different centroids (plots with border) produced with the AE-CM

Being a generative model, the AE-CM can also be used to interpolate between classes. Figure 7 shows a path made of nine interpolations between the ten centroids of the AE-CM\({}^r\). We observe smooth transitions between all the pairs, which indicate that the model learned a smooth manifold from noise (random initialization).

6 Conclusion

We presented a novel clustering algorithm that is jointly optimized with the embedding of an autoencoder to allow for nonlinear and interpretable clusterings. We first as a key result showed that the objective function of an isotropic GMM can be turned into a loss function for autoencoders. The clustering module (CM), defined as the smallest resulting network, was shown to perform similarly to its underlying GMM in extensive empirical evaluations.

Importantly, we showed how the clustering module can be straightforwardly incorporated into deep autoencoders to allow for nonlinear clusterings. The resulting clustering network, the AE-CM, empirically outperformed existing centroid-based deep clustering architectures and performed on par with representative contemporary state-of-the-art deep clustering strategies. Nevertheless, the AE-CM, and to a lesser extent of the clustering module itself, presented a greater volatility when trained from a randmly initialized network. We expect that we could improve on that point by involving an annealing strategy on the parameter, similarly to what is done in DKM and VIB-GMM.

A future line of work consist of extending the panel of deep architectures into which the clustering module can be nested. In order to improve performance on image data sets, especially, it is necessary to involve convolution. However, standard image-specific architectures are not structured as autoencoder. This raises the question of the robustness of our model with respect to the symmetry of the DAE, especially for applications where the computation of class representative is not a must.

From a theoretical point of view, we believe that the derivations that led to the neural interpretation of Gaussian mixture models could benefit other mixture models such as the von Mises-Fisher mixture models (Hasnat et al., 2017) or hidden Markov models (HMM). The case of Gaussian-HMM seems especially promising as it allows to bridge with Recurrent networks (Salaün et al., 2019).