Convex programming based spectral clustering

Mizutani, Tomohiko

doi:10.1007/s10994-020-05940-1

Convex programming based spectral clustering

Published: 14 April 2021

Volume 110, pages 933–964, (2021)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Convex programming based spectral clustering

Download PDF

Tomohiko Mizutani¹

1132 Accesses
2 Citations
6 Altmetric
Explore all metrics

Abstract

Clustering is a fundamental task in data analysis, and spectral clustering has been recognized as a promising approach to it. Given a graph describing the relationship between data, spectral clustering explores the underlying cluster structure in two stages. The first stage embeds the nodes of the graph in real space, and the second stage groups the embedded nodes into several clusters. The use of the k-means method in the grouping stage is currently standard practice. We present a spectral clustering algorithm that uses convex programming in the grouping stage and study how well it works. This algorithm is designed based on the following observation. If a graph is well-clustered, then the nodes with the largest degree in each cluster can be found by computing an enclosing ellipsoid of the nodes embedded in real space, and the clusters can be identified by using those nodes. We show that, for well-clustered graphs, the algorithm can find clusters of nodes with minimal conductance. We also give an experimental assessment of the algorithm’s performance.

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Dongkuan Xu & Yingjie Tian

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Article 27 November 2022

Gbeminiyi John Oyewole & George Alex Thopil

1 Introduction

Given a set of items and similarities between pairs of the items, clustering is the task of partitioning the item set into groups such that the items within the same group are similar and the items within different groups are dissimilar. One natural way of representing the task is to use a graph. We construct a graph such that each item corresponds to a node and, if a pair of items has high similarity, there is an edge between them. The task is cast as one of partitioning the node set into clusters of nodes such that the nodes within the same cluster are well connected and those within different clusters are poorly connected.

Spectral clustering is a way of finding such clusters in a graph. It has two stages. The first stage embeds the nodes of a graph in real space, and the second one partitions the embedded nodes into groups. The embedding uses the eigenvectors of a matrix associated with the graph, such as the Laplacian. The grouping employs a classical clustering method such as k-means. Spectral clustering is said to date back to the works of Donath and Hoffman (1973) and Fiedler (1973) in the 1970s, and it was popularized by the works of Shi and Malik (2000), Ng et al. (2001), Bach and Jordan (2003), and von Luxburg (2007) in the machine learning and data mining community in the 2000s. Its effectiveness has been tested on various problems, and it is now recognized as a promising approach to clustering.

Recently, Peng et al. investigated the performance of a spectral clustering algorithm that uses the k-means method in the grouping stage and provided a theoretical justification as to why it works well in practice. We shall use the abbreviation KSC to refer to the k-means based spectral clustering algorithm. The results first appeared in the proceedings Peng et al. (2015) of COLT 2015, and then in a journal paper (Peng et al., 2017). They quantified the quality of clusters in a graph in terms of a measurement, conductance. Clusters with minimal conductance fit the aim of clustering task. They showed that the output of KSC is a good approximation to such clusters if the input graph is well-clustered. Later, their results were improved by Kolev and Mehlhorn (2016, 2018).

A little further explanation may be needed on the results. Let G be a graph with a node set V. We call a subset S of V a cluster. We call a family of k clusters $S_1, \ldots , S_k$ a k-way partition of G if $S_i \cap S_j = \emptyset$ for different i and j and $S_1 \cup \cdots \cup S_k = V$. The conductance $\phi (S_i)$ of a cluster $S_i$ is defined to be the ratio of the cut size between $S_i$ and its complement divided by the volume of $S_i$. We formulate the clustering task on G as the problem of finding a partition $\{S_1, \ldots S_k\}$ of G that minimizes the maximum of $\phi (S_1), \ldots , \phi (S_k)$. The k-way conductance $\phi _k(G)$ of a graph G is defined to be the minimum value. We say that a partition $\{S_1, \ldots , S_k\}$ of G is optimal if it satisfies $\phi _k(G) = \max \{\phi (S_1), \ldots , \phi (S_k) \}$. Finding an optimal partition of G is known to be NP-hard (Matula & Shahrokhi, 1990). Let $\lambda _{k+1}$ denote the $(k+1)$ smallest eigenvalue of the normalized Laplacian of G. A graph G is called well-clustered if there is a large gap between $\phi _k(G)$ and $\lambda _{k+1}$; in other words, $\Upsilon = \lambda _{k+1} / \phi _k(G)$ is large. In fact, it was shown by Gharan and Trevisan (2014) that, if the gap assumption holds, there are k clusters in G such that the nodes within the same cluster are well connected and those within different clusters are poorly connected. Peng et al. measured how well the output of KSC approximates an optimal k-way partition of G and gave approximation guarantees under the assumption that G satisfies $\Upsilon = \Omega (k^3)$. Later, Kolev and Mehlhorn improved the approximation guarantees under a weaker assumption.

1.1 Our contributions

We present a spectral clustering algorithm that uses convex programming in the grouping stage and call the algorithm ELLI since an ellipsoid plays an important role in it. This is built on an extension of the structure theorem shown in Mizutani (2020). The theorem was originally developed by Peng et al. (2017) and it was later extended in Mizutani (2020). It implies that, if a graph is well-clustered, then, the nodes with the largest degree in each cluster can be found by computing an enclosing ellipsoid for the nodes embedded in real space, and the clusters can be identified by using those nodes. ELLI is designed on the basis of this observation. We examine the performance of ELLI from theoretical and practical perspectives. The main contributions of this study are summarized as follows.

We provide a theoretical analysis of the clustering performance of ELLI. In Sect. 3.3, we show in Theorem 2 that, if $\Upsilon$ exceeds some threshold, ELLI returns an optimal k-way partition of a graph. In contrast to this, no matter how large $\Upsilon$ is, the result of Peng et al. does not ensure that KSC does so. As we saw above, the threshold of Peng et al. for KSC depends on only k, while our threshold for ELLI depends on a graph as well as k. Thus, in Corollary 1, we rewrite our threshold using only k and examine how large it can be.
We reveal that algorithms for computing nonnegative matrix factorizations (NMFs) under the separability condition are useful in the grouping stage of spectral clustering In Sect. 5, we explain that, if a graph is well-clustered, the algorithms can exactly find the nodes with the largest degree in each cluster. This perspective has not been considered so far and provides insight into the design of effective algorithms in the grouping stage. We use an ellipsoidal rounding technique developed for computing separable NMFs in Mizutani (2014) and develop ELLI.
We present an experimental assessment of ELLI. We experimentally tested the effectiveness of ELLI at clustering real data, i.e., image datasets whose images had been categorized into classes by human judges. We applied ELLI to each dataset and evaluated how well the clusters found by it matched the classes of the dataset. For the evaluation, we used two measures, accuracy (AC) and normalized mutual information (NMI), which are commonly used for this purpose. The experiments also evaluated the conductance of the clusters found by ELLI. We tested two more clustering algorithms of which one was KSC. A standard implementation of spectral clustering uses the k-means method based on Lloyd’s algorithm (Lloyd, 1982). Our implementation of KSC used the k-means$++$ algorithm (Arthur & Vassilvitskii, 2007), i.e., its enhancement. Since k-means$++$ is probabilistic, we repeated KSC equipped with it multiple times and took the average of the measurements for the evaluation of the outputs. The experiments revealed that the AC and NMI of ELLI can reach at least the average AC and NMI of KSC. The experiments also showed that the conductance of the clusters found by ELLI is often smaller than the value given by KSC.

The rest of this paper is organized as follows. Section 2 explains the notation, symbols, and terminology of graphs, which will be used in the subsequent discussion. It also reviews basic results from spectral graph theory and the spectral clustering algorithm. Section 3 explains the details of ELLI and shows the theoretical performance in Theorem 2. Lastly, it describes related work, including the studies by Peng et al. and Kolev and Mehlhorn on the performance of KSC. Section 4 provides an analysis of ELLI. Section 5 reviews NMFs and the separability condition. It then explains why algorithms for computing NMFs under the separability condition can be used in the grouping stage of spectral clustering. Section 6 describes the experimental study.

Notation for vectors and matrices. The symbols $\Vert \cdot \Vert _1, \Vert \cdot \Vert _2$ and $\Vert \cdot \Vert _{\infty }$ denote the $\ell _1, \ell _2$ and infinity norms of a vector or a matrix. The symbol $\Vert \cdot \Vert _F$ denotes the Frobenius norm of a matrix. For real numbers $a_1, \ldots , a_n$, we use $\text{ diag }(a_1, \ldots , a_n)$ to denote an $n \times n$ diagonal matrix having $a_i$ in the (i, i)th entry. We use $\varvec{e}_i$ to denote the ith unit vector and $\varvec{I}$ to denote the identity matrix.

Remark 1

The previous version of this paper posted on arXiv in 2018 was reviewed by anonymous reviewers. Based on their comments and suggestions, we made major revisions in the current paper. In particular, we contained Corollary 1 that was suggested by one of the reviewers; the best and worst results of algorithms we tested in Table 2; the results in case where a neighbor size p was set as 10 in Fig. 4; and the results for ETL and MNIST datasets in Fig. 4.

2 Preliminaries

2.1 Graphs and Laplacians

Let $G = (V, E)$ be an undirected graph, where V is the set of n nodes $1, \ldots , n$ and E is the set of edges. We put a weight on each pair of nodes through the function $w: V \times V \rightarrow {\mathbb{R}}_{+}$. Here, the symbol ${\mathbb{R}}_{+}$ denotes the set of nonnegative real numbers. The function w should have the following properties. For any pair of nodes $u, v \in V$, $w(u, v) = w(v, u)$ and $w(u, v) > 0$ if $\{u, v\} \in E$; otherwise, $w(u, v) = 0$. We call a function w having the properties above a weight function on G. The degree $d_u$ of node $u \in V$ is given as $d_u = \sum _{v \in V} w(u,v)$. Throughout this paper, we always regard a graph G as an undirected one with n nodes $1, \ldots , n$ and a weight function w and assume that every node of G has a positive degree.

Let us explain the notation, symbols, and terminology that will be used in this paper. Let $G=(V,E)$ be a graph. A cluster S in G is a subset of the node set V. A k-way partition of G is a family of k clusters $S_1, \ldots , S_k$ that satisfy $S_i \cap S_j = \emptyset$ for different i and j and $S_1 \cup \cdots \cup S_k = V$. For simplicity, we sometimes call it a partition of G or a graph partition. The symbol $\Gamma$ is used to denote a k-way partition $\{S_1, \ldots , S_k\}$ of G. The symbol $n_i$ is used to denote the number of nodes in $S_i$. Let $\{S_1, \ldots , S_k\}$ be a partition of G and u be the node such that it belongs to $S_i$ and it has the jth smallest degree among all nodes in $S_i$. We use (i, j) to refer to the node u and $d_{i,j}$ to refer to the degree $d_u$. Note that the notation depends on the choice of k-way partition $\Gamma$. Following the above notation, $n_i$ nodes in $S_i$ are expressed as $(i,1), \ldots , (i,n_i)$, and the degree $d_{i,j}$ of each node (i, j) satisfies $d_{i,1} \le \cdots \le d_{i,n_i}$. The node $(i, n_i)$ belongs to $S_i$ and has the largest degree among all nodes in $S_i$. We call $(i, n_i)$ the representative node of the cluster $S_i$ and the set $\{ (1,n_1), \ldots , (k, n_k) \}$ the representative node set of the k-way partition $\Gamma = \{S_1, \ldots , S_k\}$.

Next let us review some basic results from spectral graph theory. The adjacency matrix $\varvec{W}$ is an $n \times n$ symmetric matrix such that the (u, v)th entry stores the weight w(u, v) of the pair of nodes $u, v \in V$. The degree matrix $\varvec{D}$ is an $n \times n$ diagonal matrix such that the (u, u)th entry stores the degree $d_u$ of node $u \in V$. The Laplacian $\varvec{L}$ of G is given as $\varvec{L}= \varvec{D}- \varvec{W}$, and the normalized Laplacian $\mathcal {L}$ is given as $\mathcal {L}= \varvec{D}^{-1/2} \varvec{L}\varvec{D}^{-1/2}$, which is equivalent to $\varvec{I}- \varvec{D}^{-1/2} \varvec{W}\varvec{D}^{-1/2}$. The eigenvalues and eigenvectors of the normalized Laplacian $\mathcal {L}$ will play an important role in our discussion. Since $\mathcal {L}$ is an $n \times n$ real symmetric matrix, the n eigenvalues are real and the n eigenvectors can be chosen to be orthonormal bases in ${\mathbb{R}}^n$. Furthermore, an easy calculation shows that $\mathcal {L}$ is positive semidefinite. Hence, all the eigenvalues are nonnegative. The smallest eigenvalue is zero, since $\mathcal {L}\cdot (\varvec{D}^{1/2}\varvec{1}) = \varvec{0}$, where the symbol $\varvec{1}$ denotes a vector of all ones. In addition, the largest eigenvalue is less than two. The multiplicity of the zero eigenvalue equals to the number of connected components of G. The above are basic results from spectral graph theory; for details, see Chung (1997); von Luxburg (2007). In this paper, we will always use the symbols $\lambda _1, \ldots , \lambda _n$ to denote the eigenvalues of $\mathcal {L}$ arranged in nondecreasing order, i.e., $0 = \lambda _1 \le \cdots \le \lambda _n \le 2$. Moreover, we will always choose the eigenvectors of $\mathcal {L}$ to be orthonormal and use the symbol $\varvec{f}_i$ to denote the eigenvector corresponding to the ith smallest eigenvalue $\lambda _i$.

2.2 Conductance

Let $G = (V,E)$ be a graph. Let S be a cluster in G. The conductance of a cluster S is defined to be

$$\begin{aligned} \phi (S) := \frac{w(S, V {\setminus } S)}{\mu (S)} \end{aligned}$$

(1)

by letting

$$\begin{aligned} \mu (S) := \sum _{u \in S} d_{u} \quad \text{ and } \quad w(S, V {\setminus } S) := \sum _{u \in S} \sum _{v \in V {\setminus } S} w(u, v). \end{aligned}$$

Here, $\mu (S)$ is the volume of S, and $w(S, V {\setminus } S)$ is the cut size between S and its complement $V {\setminus } S$. We can see from the definition of $\phi (S)$ that clusters with low conductance capture the notion of good clusters in G, wherein nodes within the same cluster have high weights and nodes within different clusters have low weights. Kannan et al. (2004) suggested that conductance is an effective way of quantifying the quality of clusters in G.

The conductance problem asks one to find a k-way partition $\{S_1, \ldots , S_k\}$ of G that minimizes the maximum of $\phi (S_1), \ldots , \phi (S_k)$. The k-way conductance of a graph G is defined to be the minimum value, and we use the symbol $\phi _k(G)$ to denote it. That is,

$$\begin{aligned} \phi _k(G) = \min _{\{S_1, \ldots , S_k\}} \max \{ \phi (S_1), \ldots , \phi (S_k)\} \end{aligned}$$

and the minimum is taken over all candidates of k-way partitions of G. We say that a k-way partition $\{S_1, \ldots , S_k\}$ of G is optimal if it satisfies $\phi _k(G) = \max \{ \phi (S_1), \ldots , \phi (S_k)\}$.

Finding an optimal k-way partition of G is intractable; it is known to be NP-hard even if $k=2$; see Matula and Shahrokhi (1990). There are approximation algorithms for $k=2$, and in particular, the SDP-based algorithm of Arora et al. (2009) achieves an $O(\sqrt{\log n})$-approximation ratio. Cheeger inequality bounds $\phi _2(G)$ by using the second smallest eigenvalue $\lambda _2$ of the normalized Laplacian $\mathcal {L}$ of G. The bound was improved by Kwok et al. (2013). Regarding the general case, Lee et al. (2012) developed a higher-order Cheeger inequality. It bounds $\phi _k(G)$ by using the kth smallest eigenvalue $\lambda _k$ of $\mathcal {L}$.

Peng et al. (2017) examined the performance of KSC for a class of graphs, called well-clustered graphs. For a graph G, we define

$$\begin{aligned} \Upsilon := \frac{\lambda _{k+1}}{\phi _k(G)} \end{aligned}$$

for the k-way conductance $\phi _k(G)$ and the $(k+1)$th smallest eigenvalue $\lambda _{k+1}$ of the normalized Laplacian $\mathcal {L}$. A graph G is called well-clustered if $\Upsilon$ is large. Let us see why it is well-clustered in that case. The higher-order Cheeger inequality implies that, if $\Upsilon$ is large, so is $\lambda _{k+1} / \lambda _k$. We recall the result of Gharan and Trevisan (2014) who studied graphs with a gap between $\lambda _k$ and $\lambda _{k+1}$. Let S be a subset of node set in G, i.e., cluster. The outside conductance of S is defined to be $\phi (S)$. The inside conductance of S is defined to be $\phi _2(G[S])$, which is the two-way conductance of a subgraph G[S] induced by S. Let $\Gamma = \{S_1, \ldots , S_k\}$ be a k-way partition of G. Following the terminology of Gharan and Trevisan, we say that $\Gamma$ is a $(\phi _{\text{ in }}, \phi _{\text{ out }})$-clustering if $\phi _2(G[S_i]) \ge \phi _{\text{ in }}$ and $\phi (S_i) \le \phi _{\text{ out }}$ for $i=1, \ldots , k$. They showed in Corollary 1.1 of Gharan and Trevisan (2014) that, if there is a large gap between $\lambda _k$ and $\lambda _{k+1}$, there is a $( \Omega ( \lambda _{k+1} / k ), O( k^3 \sqrt{\lambda _k}) )$-clustering. Thus, we could say that G is well-clustered if $\Upsilon$ is large. A similar observation can be found in Section 1 of Kolev and Mehlhorn (2016).

Kolev and Mehlhorn (2016) used a measurement $\Psi$ different from $\Upsilon$ for analyzing the performance of KSC. We denote by U be the set of all optimal k-way partitions of G. Let

$$\begin{aligned} \bar{\phi }_k(G) = \min _{\{S_1, \ldots , S_k\}\in U} \frac{1}{k} \left( \phi (S_1) + \cdots + \phi (S_k) \right) . \end{aligned}$$

In analogy with $\Upsilon$, define

$$\begin{aligned} \Psi := \frac{\lambda _{k+1}}{\bar{\phi }_k(G)}. \end{aligned}$$

Since $\bar{\phi }_k(G) \le \phi _k(G)$, we have $\Psi \ge \Upsilon$.

2.3 Spectral clustering algorithm

Here, we describe the framework of the spectral clustering algorithm. The input is the normalized Laplacian $\mathcal {L}$ of a graph $G = (V, E)$ and the number k of clusters the user desires.

1.
(Embedding stage) Compute the bottom k eigenvectors $\varvec{f}_1, \ldots , \varvec{f}_k \in {\mathbb{R}}^n$ of $\mathcal {L}$ and construct the spectral embedding map $F : V \rightarrow {\mathbb{R}}^k$ using them. Apply F to the nodes $1, \ldots , n$ of G and form a set X of pints $F(1), \ldots , F(n) \in {\mathbb{R}}^k$.
2.
(Grouping stage) Find a k-way partition $\{X_1, \ldots , X_k\}$ of X using a clustering algorithm the user prefers. Return $\{T_1, \ldots , T_k\}$ by letting $T_i = \{u : F(u) \in X_i\}$ for $i=1, \ldots , k$.

The embedding stage constructs a spectral embedding map, which is defined as follows. Let $\varvec{P}= [\varvec{f}_1, \ldots , \varvec{f}_k]^\top \in {\mathbb{R}}^{k \times n}$ for the bottom k eigenvectors $\varvec{f}_1, \ldots , \varvec{f}_k$ of $\mathcal {L}$, and $\varvec{p}_u$ denote the uth column of $\varvec{P}$. Spectral embedding map is a map $F : V \rightarrow {\mathbb{R}}^k$ defined by

$$\begin{aligned} F(u) = s_u \cdot \varvec{p}_u \end{aligned}$$

(2)

for a scaling factor $s_u \in {\mathbb{R}}$. The scaling factor is often set as $s_u = 1 / \sqrt{d_u}$ for the degree $d_u$ of node u or $s_u = 1 / \Vert \varvec{p}_u\Vert _2$. The former was proposed by Shi and Malik (2000) and the latter by Ng et al. (2001).

It is standard practice to use the k-means method based on Lloyd’s algorithm (Lloyd, 1982) in the grouping stage (this was suggested in Ng et al. (2001); von Luxburg (2007)). We quickly review the k-means method here. Let $\varvec{p}_1, \ldots , \varvec{p}_n$ be points in ${\mathbb{R}}^k$. We arbitrarily choose a k-way partition $\{S_1, \ldots , S_k\}$ of the set $S = \{\varvec{p}_1, \ldots , \varvec{p}_n\}$. As in the case of a k-way partition of a graph, we use the symbol $\Gamma$ to refer to it. The clustering cost function f is given by

$$\begin{aligned} f(\Gamma ) = \min _{\varvec{c}_1, \ldots , \varvec{c}_k \in {\mathbb{R}}^k} \sum _{i=1}^{k} \sum _{u \in S_i} \Vert \varvec{p}_u - \varvec{c}_i \Vert _2^2. \end{aligned}$$

(3)

The k-means method chooses a k-way partition $\Gamma$ of S to minimize the clustering cost function $f(\Gamma )$. Finding $\Gamma$ that minimizes f is shown to be NP-hard in Aloise et al. (2009); Mahajan et al. (2009). Lloyd’s algorithm approximately solves the minimization problem. It starts by arbitrarily choosing $\varvec{c}_1, \ldots , \varvec{c}_k$ as initial seeds and then minimizes $f(\Gamma )$ by alternatively fixing either $\varvec{c}_1, \ldots , \varvec{c}_k$ or $S_1, \ldots , S_k$. This works well in practice. The k-means++ algorithm presented in Arthur and Vassilvitskii (2007) provides a smart choice of initial seeds. It chooses $\varvec{c}_1$ uniformly at random from the set of data points and then chooses $\varvec{c}_{i+1}$ from the set according to a probability determined by the choice of $\varvec{c}_1, \ldots , \varvec{c}_i$.

3 Algorithm and analysis results

3.1 Outline of ELLI

Peng et al. developed the structure theorem (Theorem 3.1 of Peng et al. (2017)) for analyzing the performance of KSC. Later, the theorem was extended in Mizutani (2020). ELLI is built upon the extension of the structure theorem. Let us recall it here. For a k-way partition $\{S_1, \ldots , S_k\}$ of a graph, we define the indicator $\varvec{g}_i \in {\mathbb{R}}^n$ of $S_i$ to be the vector whose jth element is one if $j \in S_i$ and zero otherwise. The normalized indicator $\bar{\varvec{g}}_i \in {\mathbb{R}}^n$ of $S_i$ is given as

$$\begin{aligned} \bar{\varvec{g}}_i = \frac{\varvec{D}^{1/2}\varvec{g}_i}{\Vert \varvec{D}^{1/2}\varvec{g}_i \Vert _2} \end{aligned}$$

for the degree matrix $\varvec{D}$ of the graph. Note that $\Vert \varvec{D}^{1/2}\varvec{g}_i \Vert _2$ is equal to $\sqrt{\mu (S_i)}$.

Theorem 1

(Corollary 1 of Mizutani (2020)) Let a graph G satisfy $\Upsilon > 0$. Let a k-way partition $\{S_1, \ldots , S_k\}$ of G be optimal. Form $\bar{\varvec{G}} = [\bar{\varvec{g}}_1, \ldots , \bar{\varvec{g}}_k] \in {\mathbb{R}}^{n \times k}$ for the normalized indicators $\bar{\varvec{g}}_1, \ldots , \bar{\varvec{g}}_k$ of $S_1, \ldots , S_k$, and form $\varvec{F}= [\varvec{f}_1, \ldots , \varvec{f}_k] \in {\mathbb{R}}^{n \times k}$ for the bottom k eigenvectors $\varvec{f}_1, \ldots , \varvec{f}_k$ of the normalized Laplacian of G. Then, there is some $k \times k$ orthogonal matrix $\varvec{U}$ such that

$$\begin{aligned} \Vert \varvec{F}\varvec{U}- \bar{\varvec{G}} \Vert _2 \le k / \Upsilon + \sqrt{k / \Upsilon }. \end{aligned}$$

Since the relation $\sqrt{x} \ge x$ holds for $0 \le x \le 1$, the corollary implies that $\Vert \varvec{F}\varvec{U}- \bar{\varvec{G}} \Vert _2 \le 2 \sqrt{k / \Upsilon }$ if $\Upsilon \ge k$. A spectral clustering algorithm maps the nodes $1, \ldots , n$ of a graph onto the points $F(1), \ldots , F(n)$ in ${\mathbb{R}}^k$ by using the spectral embedding map F and then partitions $F(1), \ldots , F(n)$ into k groups. Theorem 1 tells us how $F(1), \ldots , F(n)$ are located in ${\mathbb{R}}^k$. Let $\varvec{P}= \varvec{F}^\top \in {\mathbb{R}}^{k \times n}$, and $\varvec{p}_u$ denote the uth column of $\varvec{P}$. In the same way, let $\varvec{Q}= \bar{\varvec{G}}^\top \in {\mathbb{R}}^{k \times n}$, and $\varvec{q}_u$ denote the uth column of $\varvec{Q}$. We choose the spectral embedding map defined by $F(u) = \varvec{p}_u$. The theorem implies that $\varvec{P}$ and $\varvec{Q}$ are related as follows:

$$\begin{aligned} \varvec{P}= \varvec{U}\varvec{Q}+ \varvec{R}\end{aligned}$$

where $\varvec{U}$ is a $k \times k$ orthogonal matrix and $\varvec{R}$ is a $k \times n$ matrix that satisfies $\Vert \varvec{R}\Vert _2 \le 2 \sqrt{k / \Upsilon }$ if $\Upsilon \ge k$. Our choice of F satisfies $F(u) = \varvec{p}_u$. The relationship shown above tells us that $\varvec{p}_u$ is close to $\varvec{U}\varvec{q}_u$ for $u = 1, \ldots , n$ if $\Upsilon$ is large.

Let us look closely at the columns of $\varvec{P}$ and $\varvec{Q}$. The matrix $\varvec{Q}$ is the transpose of $\bar{\varvec{G}} = [\bar{\varvec{g}}_1, \ldots , \bar{\varvec{g}}_k]$ where $\bar{\varvec{g}}_i = \varvec{D}^{1/2} \varvec{g}_i / \Vert \varvec{D}^{1/2} \varvec{g}_i\Vert _2$ for the indicator $\varvec{g}_i$ of $S_i$ and the degree matrix $\varvec{D}$ of G. Hence, if the node u belongs to $S_i$, the uth column $\varvec{q}_u$ of $\varvec{Q}$ is

$$\begin{aligned} \varvec{q}_u = \sqrt{\frac{d_u}{\mu (S_i)}} \varvec{e}_i. \end{aligned}$$

Here, recall the notation for describing the nodes of a graph that we introduced in Sect. 2.1. Let $\{S_1, \ldots , S_k\}$ be a k-way partition of a graph and u be the node such that it belongs to $S_i$ and has the jth smallest degree among all nodes in $S_i$. The notation (i, j) refers to the node u. Following this notation, let $\varvec{p}_{i,j}$, $\varvec{q}_{i,j}$ and $\varvec{r}_{i,j}$ denote the uth columns $\varvec{p}_u$, $\varvec{q}_u$, and $\varvec{r}_u$ of $\varvec{P}$, $\varvec{Q}$, and $\varvec{R}$, respectively. The columns of $\varvec{P}$ and $\varvec{Q}$ can be expressed as

$$\begin{aligned} \varvec{p}_{i,j} = \alpha _{i,j} \varvec{u}_i + \varvec{r}_{i,j} \quad \text{ and } \quad \varvec{q}_{i,j} = \alpha _{i,j} \varvec{e}_i \end{aligned}$$

(4)

by using $\alpha _{i,j}$ defined by

$$\begin{aligned} \alpha _{i,j} := \sqrt{\frac{d_{i,j}}{\mu (S_i)}} \end{aligned}$$

(5)

for $i=1,\ldots ,k$ and $j = 1, \ldots , n_i$. Here, $\varvec{u}_i$ denotes the ith column of the orthogonal matrix $\varvec{U}$. Let $\Upsilon \ge k$. Since $\Vert \varvec{r}_{i,j} \Vert _2 \le \Vert \varvec{R}\Vert _2 \le 2 \sqrt{k / \Upsilon }$ holds, the distance between the point $\varvec{p}_{i,j}$ and the line spanned by $\varvec{u}_i$ is at most $2 \sqrt{k / \Upsilon }$. Hence, if $\Upsilon$ is large, then $\varvec{p}_{i,j}$ is close to the line spanned by $\varvec{u}_i$ that is orthogonal to $\varvec{u}_\ell$ for $\ell \ne i$. Figure 1 illustrates the columns of $\varvec{P}$.

The task in the grouping stage is to partition the columns $\varvec{p}_1, \ldots , \varvec{p}_n$ of $\varvec{P}$ into k groups. In particular, the goal is to find k groups that correspond to k clusters $S_1, \ldots , S_k$ in an optimal k-way partition of a graph. Based on the observations described above, we develop an algorithm for this task. In particular, we focus on the orthogonality of $\varvec{u}_1, \ldots , \varvec{u}_k$. Assume that we have exactly one element for each $S_1, \ldots , S_k$. Let $u_i$ denote the element of $S_i$ that we have and I be the set of $u_1, \ldots , u_k$. Our strategy is as follows. Initialize sets $T_1, \ldots , T_k$ to be empty, and repeat the following procedure from $v=1$ until n: find

$$\begin{aligned} i^* = \arg \max _{i = 1, \ldots , k} \varvec{p}_{u_i}^\top \varvec{p}_v \end{aligned}$$

for column $\varvec{p}_v$, and store the column index v in $T_{i^*}$. The obtained $T_i$ coincides with $S_i$ for $i=1, \ldots , k$ if $\varvec{R}= \varvec{0}$. This is because the value of $\varvec{p}_{u_i}^\top \varvec{p}_v$ for $v \in S_j$ is positive if $i = j$; otherwise, it is zero, and hence, v is stored in $T_j$. In Sect. 4.2, we show in Theorem 4 that $T_i$ still coincides with $S_i$ for $i=1, \ldots , k$ if $\Vert \varvec{R}\Vert _2$ is smaller than some threshold.

In the implementation of this strategy, a question arises as to how to find the set I that contains k elements belonging to each of $S_1, \ldots , S_k$. To address this question, we leverage the ellipsoidal rounding (ER) algorithm in Mizutani (2014). That is, we compute the minimum-volume ellipsoid, centered at the origin, that contains the columns of $\varvec{P}$ and then find all points on the boundary of the ellipsoid. ER was originally developed for solving separable NMF problems whose formal description is given in Sect. 5. In Sect. 4.1, we show in Theorem 3 that the set of obtained points exactly coincides with the representative node set of $\{S_1, \ldots , S_k\}$ if $\Vert \varvec{R}\Vert _2$ is smaller than some threshold.

3.2 Description of the algorithm

We describe each step of ELLI in Algorithm 1. Step 1 is the embedding stage, and Steps 2 and 3 are the grouping stage. Step 2 aims to find the representative node set of a partition of a graph G. Step 3 aims to find the partition of G by using the representative node set.

Step 2 should be explained in detail. The ellipsoid centered at the origin in ${\mathbb{R}}^k$ is a set $H = \{ \varvec{a}\in {\mathbb{R}}^k : \varvec{a}^\top \varvec{M}\varvec{a}\le 1\}$ for a $k \times k$ symmetric positive definite matrix $\varvec{M}$. The volume of H is $v(k) / \sqrt{\det \varvec{M}}$, where v(k) is the volume of a unit ball in ${\mathbb{R}}^k$, and the value depends on k. Step 2 constructs a minimum-volume enclosing ellipsoid (MVEE) centered at the origin for the set X of points $\varvec{p}_1, \ldots , \varvec{p}_n$ in ${\mathbb{R}}^k$. The ellipsoid can be obtained by solving an optimization problem with a symmetric matrix variable $\varvec{X}$,

$$\begin{aligned} \begin{array}{lll} \mathsf P(S): &{} \text{ minimize } &{} - \log \det \varvec{X}, \\ &{} \text{ subject } \text{ to } &{} \varvec{p}^\top \varvec{X}\varvec{p}\le 1 \ \text{ for } \text{ all } \ \varvec{p}\in S, \\ &{} &{} \varvec{X}\succ \varvec{0}. \end{array} \end{aligned}$$

The notation $\varvec{X}\succ \varvec{0}$ means that $\varvec{X}$ is positive definite. The origin-centered MVEE for S, denoted by H(S), is given as $H(S) = \{ \varvec{a}\in {\mathbb{R}}^k : \varvec{a}^\top \varvec{X}\varvec{a}\le 1\}$ for the optimal solution $\varvec{X}$ of $\mathsf P(S)$. We call a point $\varvec{p}_i$ and its index i the active point and the active index of H(S), if $\varvec{p}_i$ satisfies $\varvec{p}_i^\top \varvec{X}\varvec{p}_i = 1$; in other words, $\varvec{p}_i$ lies on the boundary of H(S). Step 2 may use the successive projection algorithm (SPA) in Araújo et al. (2001); Gillis and Vavasis (2014) that is usually used for solving separable NMF problems. In Sect. 5, we explain the connection between finding the representative node set of a partition of G and solving separable NMF problems.

Let us examine the computational cost of Steps 2 and 3 (as Step 1 is a common to spectral clustering algorithms). The main cost in Step 2 is in computing the optimal solution of problem $\mathsf P(S)$. This is a convex programming problem, and efficient algorithms exist for solving it. Khachiyan (1996) developed the Frank-Wolfe algorithm for solving the dual problem and evaluated the computational cost. Kumar and Yildirim (2005) modified the algorithm and showed that the modification returns a $(1+\epsilon )$-approximation solution in $O(nk^3 / \epsilon )$. An interior-point method within a cutting plane framework can quickly solve the problem in practice. The main cost in Step 3 is in computing $\bar{\varvec{p}}_{u_i}^\top \bar{\varvec{p}}_v$ for $i=1,\ldots ,k$ and $v=1,\ldots ,n$. The computation takes $O(nk^2)$.

The author presented a spectral clustering algorithm in Mizutani (2015). This algorithm shares Steps 1 and 2 in common with ELLI, but does not share Step 3. That manuscript mainly studied the similarity between algorithms for spectral clustering and separable NMFs.

3.3 Results of performance analysis

Here, we state the results of our analysis; the details are given in Sect. 4. Let $\alpha _{i,j}$ be defined as in (5). They satisfy $\alpha _{i,1} \le \cdots \le \alpha _{i,n_i}$. Define

$$\begin{aligned} \alpha _{\text{ min }} := \min _{\begin{array}{c} i=1, \ldots , k \\ j=1, \ldots , n_i \end{array}} \alpha _{i,j} \quad \text{ and } \quad \hat{\alpha }_{\text{ min }} := \min _{i=1, \ldots , k} \alpha _{i, n_i}. \end{aligned}$$

We can rewrite $\alpha _{\text{ min }}$ as $\alpha _{\text{ min }} = \min _{i=1, \ldots , k} \alpha _{i,1}$. Define

$$\begin{aligned} \theta _{i,j} := \frac{\alpha _{i,j}}{\alpha _{i,n_i}} = \sqrt{\frac{d_{i,j}}{d_{i, n_i}}} \end{aligned}$$

(6)

for $i = 1, \ldots , k$ and $j = 1, \ldots , n_i-1$. As is the case with $\alpha _{i,j}$, they satisfy $\theta _{i,1} \le \cdots \le \theta _{i, n_i-1}$. Define

$$\begin{aligned} \theta _{\text{ min }} := \min _{\begin{array}{c} i=1,\ldots ,k \\ j=1,\ldots ,n_i-1 \end{array}} \theta _{i,j} \quad \text{ and } \quad \theta _{\text{ max }} := \max _{\begin{array}{c} i=1,\ldots ,k \\ j=1, \ldots , n_i-1 \end{array}} \theta _{i,j}. \end{aligned}$$

We can rewrite them as $\theta _{\text{ min }} = \min _{i=1,\ldots ,k}\theta _{i,1}$ and $\theta _{\text{ max }} = \max _{i=1,\ldots ,k}\theta _{i,n_i-1}$. Now let us introduce the parameters $\alpha$ and $\theta$. These parameters are determined by $\alpha _{i,j}$ and $\theta _{i,j}$, which are determined by choosing one of the k-way partitions of a graph. Let a k-way partition $\Gamma$ be optimal. The parameter $\alpha$ is set as

$$\begin{aligned} \alpha = \hat{\alpha }_{\text{ min }} \end{aligned}$$

for $\hat{\alpha }_{\text{ min }}$ determined by the optimal k-way partition $\Gamma$. This satisfies $\alpha > 0$, since $d_{i,j}$ are all positive. The parameter $\theta$ is set as

$$\begin{aligned} \theta = \min \biggl \{\frac{1}{2} (1 - \theta _{\text{ max }}), \ (17-12\sqrt{2})\theta _{\text{ min }} \biggr \} \end{aligned}$$

for $\theta _{\text{ min }}$ and $\theta _{\text{ max }}$ determined by the optimal k-way partition $\Gamma$. This satisfies $\theta \ge 0$, since $0 < \theta _{\text{ min }}\le \theta _{\text{ max }} \le 1$. Here, $\theta _{\text{ min }}$ is strictly greater than zero, as $d_{i,j}$ are all positive. Theorem 2 is the main result of our analysis.

Theorem 2

If a graph G satisfies

$$\begin{aligned} \Upsilon > \frac{4k}{(\theta \alpha )^2}, \end{aligned}$$

then the output of ELLI coincides with an optimal k-way partition of G.

The proof is given in Sect. 4.3. Here, let us compare Theorem 2 with the results of Peng et al. (2017) and Kolev and Mehlhorn (2016) (a detailed description of their results is given in Sect. 3.4). The results of Peng et al. tell us that the output of KSC gets closer to the optimal k-way partition of a graph as $\Upsilon$ gets larger. However, it does not ensure that the output is exactly the optimal one no matter how large $\Upsilon$ is. The same goes for the results of Kolev and Mehlhorn. Meanwhile, Theorem 2 ensures that the output of ELLI is the optimal one if $\Upsilon$ exceeds some threshold. But, the threshold could be large. Let us look at this in more detail. Peng et al. showed the performance of KSC for graphs satisfying $\Upsilon = \Omega (k^3)$. Let us compare the bound on $\Upsilon$ in Theorem 2 with that of Peng et al. Although the bound of Theorem 2 contains $\theta$ and $\alpha$ as well as k, we can rewrite it using only k as in the following corollary.

Corollary 1

Let positive real numbers p, q, r satisfy the following three conditions. For every $i=1, \ldots , k$,

$d_{i, n_i-1} \le \left( 1 - \frac{1}{k^{p/2}} \right) ^2 \cdot d_{i, n_i}$,
$d_{i,1} \ge \left( \frac{1}{k^q} \right) \cdot d_{i, n_i}$,
$d_{i,n_i} \ge \left( \frac{1}{k^r} \right) \cdot \mu (S_i)$.

If a graph G satisfies $\Upsilon = \Omega \left( k^{(r+1) \cdot \max \{p, q\}} \right)$, then the output of ELLI coincides with an optimal k-way partition of G.

The proof is given in Sect. 4.3. We can bound $\mu (S_i)$ from above by $n_i \cdot d_{i, n_i}$. This bound implies that the choice of $r = \log _k n_i$ satisfies the third condition of Corollary 1. Hence, if $n_i \ge k^2$ and $\max \{p,q\} \ge 1$, we have $(r+1) \cdot \max \{p, q\} \ge 3$. Accordingly, in most cases, Theorem 2 imposes a stronger restriction on graphs compared with the results of Peng et al.

3.4 Related work

We describe the results of Peng et al. (2017) in more detail. Let a k-way partition $\{S_1, \ldots , S_k\}$ of a graph be optimal. We choose an algorithm for minimizing the clustering cost function f shown in (3) and assume that the algorithm has an approximation ratio of $\eta$. Let $T_1, \ldots , T_k$ be the output of a spectral clustering algorithm that uses a k-means method based on the $\eta$-approximation algorithm. Peng et al. showed in Theorem 1.2 of Peng et al. (2017) that, if a graph satisfies $\Upsilon = \Omega (k^3)$, then the following holds after a suitable renumbering of the output of the algorithm.

$$\begin{aligned} \mu (S_i \triangle T_i) = O\biggl (\frac{\eta k^3}{\Upsilon } \biggr ) \mu (S_i) \quad \text{ and } \quad \phi (T_i) = 1.1 \phi (S_i) + O\biggl (\frac{\eta k^3}{\Upsilon } \biggr ). \end{aligned}$$

Here, the notation $S_i \triangle T_i$ denotes the symmetric difference of the sets $S_i$ and $T_i$, i.e., $S_i \triangle T_i = (S_i {\setminus } T_i) \cup (T_i {\setminus } S_i)$. This result tells us that, if $\Upsilon = \Omega (k^3)$, the output $\{T_1, \ldots , T_k\}$ is close to the optimal k-way partition $\{S_1, \ldots , S_k\}$ of the graph, and, in particular, the output gets closer to the optimal one as $\Upsilon$ gets larger. Peng et al. (2017) also developed a nearly linear time algorithm for clustering by using the heat kernel of a graph and nearest neighbor data structures.

Kolev and Mehlhorn (2016) improved on the results of Peng et al. They showed in Theorem 1.2 of their paper that, if a graph satisfies $\Psi = \Omega (k^3)$, then the following holds after a suitable renumbering of the output of the algorithm.

$$\begin{aligned} \mu (S_i \triangle T_i) = O\biggl (\frac{\eta k^2}{\Psi } \biggr ) \mu (S_i) \quad \text{ and } \quad \phi (T_i) = 1.1 \phi (S_i) + O\biggl (\frac{\eta k^2}{\Psi } \biggr ). \end{aligned}$$

The results of Kolev and Mehlhorn are an improvement over those of Peng et al. The bounds on the approximation accuracy are reduced by a factor of k and the gap assumption is weakened due to $\Psi \ge \Upsilon$. Kolev and Mehlhorn (2016) also analyzed a spectral clustering algorithm that uses a variant of Lloyd’s algorithm, which was proposed in Ostrovsky et al. (2012), for the k-means method. The results of Kolev and Mehlhorn were further improved in Mizutani (2020).

There is a line of research that explores spectral clustering from a theoretical perspective. Spectral clustering maps the nodes of a graph onto points in real space through the spectral embedding map. Using the Davis–Kahan theorem from matrix perturbation theory, Ng et al. (2001) showed that the resulting points are nearly orthogonal. Kannan et al. (2004) introduced bicriteria to quantify the quality of clusters where one criterion is the inside conductance of a cluster, which was explained in Sect. 2.2, and the other is the total weight of the inter-cluster edges. They assumed that a graph has clusters such that the inside conductance of the clusters is large and the total weight of the inter-cluster edges is small. They evaluated how close the output is to the clusters. As we saw in Sect. 2.2, Gharan and Trevisan (2014) showed that, if there is a large gap between $\lambda _k$ and $\lambda _{k+1}$, there exists a k-way partition of a graph with a large inside conductance and small outside conductance. Here, recall that $\lambda _k$ and $\lambda _{k+1}$ are the kth and the $(k+1)$th smallest eigenvalues of the normalized Laplacian of the graph. They also showed that, if a graph satisfies $\lambda _{k+1} > 0$, there is a polynomial time algorithm that outputs a $\ell$-way partition of the graph that is a $(\Omega (\lambda _{k+1}^2 / k^4), O(k^6\sqrt{\lambda _k}) )$-clustering where $1 \le \ell \le k$. Dey et al. (2019) developed a greedy algorithm that partitions the node set of a graph into clusters with a large inside conductance and small outside conductance. They showed that, if there is a large gap between $\lambda _k$ and $\lambda _{k+1}$, the output of the algorithm is close to a $(\Omega (\lambda _{k+1} / k), O(k^3\sqrt{\lambda _k}) )$-clustering. Sinop (2016) studied a spectral clustering algorithm in the context of the edge expansion problem, which is related to the conductance problem, and evaluated the accuracy of its output by using a similar measurement to $\Upsilon$.

There is also a considerable amount of research on spectral clustering on a random graph. In the planted partition model, one assumes that the node set is partitioned into several clusters and edges connecting the nodes are stochastically generated: any two nodes in the same cluster have an edge with probability p, and any two nodes in different clusters have an edge with probability q. McSherry (2001) showed that spectral clustering can identify the clusters with high probability if p and q lie within some range. Rohe et al. (2011) and Lei and Rinaldo (2015) studied KSC on a stochastic block model.

4 Analysis of the algorithm

4.1 Step 2

We analyze Step 2 of ELLI and state the result (Theorem 3). The analysis of Step 3 and the result (Theorem 4) are given in the next section. Let $\varvec{P}\in {\mathbb{R}}^{k \times n}$ be the matrix constructed in Step 1. We take a k-way partition $\Gamma = \{ S_1, \ldots , S_k \}$ of a graph and construct $\varvec{Q}= [\bar{\varvec{g}}_1, \ldots \bar{\varvec{g}}_k]^\top \in {\mathbb{R}}^{k \times n}$ for the normalized indicators $\bar{\varvec{g}}_1, \ldots , \bar{\varvec{g}}_k$ of $S_1, \ldots , S_k$. Choosing a $k \times k$ orthogonal matrix $\varvec{U}$, we express $\varvec{P}$ as

$$\begin{aligned} \varvec{P}= \varvec{U}\varvec{Q}+ \varvec{R}\end{aligned}$$

(7)

where $\varvec{R}$ is a $k \times n$ and serves as the residual between $\varvec{P}$ and $\varvec{U}\varvec{Q}$. In the analysis of Steps 2 and 3, we use this expression for $\varvec{P}$. As explained in Sect. 3.1, if $\Gamma$ is optimal and $\Upsilon \ge k$ holds, there is some orthogonal matrix $\varvec{U}$ such that $\varvec{R}$ satisfies $\Vert \varvec{R}\Vert _2 \le 2 \sqrt{k / \Upsilon }$. It should be noted that we do not specify $\Gamma$ to be optimal in Theorems 3 and 4. We will use the result presented in Mizutani (2014) for analyzing Step 2.

Proposition 1

(Corollary 4 of Mizutani (2014)) Let $\varvec{a}_1, \ldots , \varvec{a}_k, \varvec{b}_1, \ldots , \varvec{b}_m$ be points in ${\mathbb{R}}^k,$ and let $\varvec{M}= [\varvec{a}_1, \ldots , \varvec{a}_k] \in {\mathbb{R}}^{k \times k}$. Assume that the points satisfy the following conditions.

$\varvec{M}$ is nonsingular.
For any $\varvec{b}\in \{\varvec{b}_1, \ldots , \varvec{b}_m\}$, there exists some vector $\varvec{c}\in {\mathbb{R}}^k$ such that $\varvec{b}= \varvec{M}\varvec{c}$ and $\Vert \varvec{c}\Vert _2 < 1$.

Let H(S) be an origin-centered MVEE for the set S of points $\varvec{a}_1, \ldots , \varvec{a}_k, \varvec{b}_1, \ldots , \varvec{b}_m$. Then, the active points of H(S) are $\varvec{a}_1, \ldots , \varvec{a}_k$.

Using the Karush–Kuhn–Tucker (KKT) conditions, we can easily check the correctness of this assertion. For the set S of the points $\varvec{a}_1, \ldots , \varvec{a}_k, \varvec{b}_1, \ldots , \varvec{b}_m$, these conditions imply that $(\varvec{M}\varvec{M}^\top )^{-1}$ is an optimal solution for problem $\mathsf P(S)$. We have $\varvec{a}_i^\top (\varvec{M}\varvec{M}^\top )^{-1} \varvec{a}_i = 1$ and, for any $\varvec{b}\in \{\varvec{b}_1, \ldots , \varvec{b}_m\}$,

$$\begin{aligned} \varvec{b}^\top (\varvec{M}\varvec{M}^\top )^{-1} \varvec{b}= \varvec{c}^\top \varvec{M}^\top (\varvec{M}^\top )^{-1} \varvec{M}^{-1} \varvec{M}\varvec{c}= \Vert \varvec{c}\Vert _2^2 < 1. \end{aligned}$$

Hence, the active points of the origin-centered MVEE for S are $\varvec{a}_1, \ldots , \varvec{a}_k$.

Theorem 3

Let $\varvec{P}= \varvec{U}\varvec{Q}+ \varvec{R}$ be of the form shown in (7). Let $\Gamma$ be the k-way partition of a graph corresponding to $\varvec{Q}$. Let H(S) be an origin-centered MVEE for the set of all columns of $\varvec{P}$. If

$$\begin{aligned} \Vert \varvec{R}\Vert _2 < \frac{1}{2} (1-\theta _{\text{ max }})\hat{\alpha }_{\text{ min }}, \end{aligned}$$

then the active index set of H(S) coincides with the representative node set of $\Gamma$.

Recall that $\theta _{i,j}$ and $\alpha _{i,j}$ are defined as in (5) and (6), respectively; in the theorem above, $\theta _{\text{ max }}$ is the largest among $\theta _{i,j}$ for $i=1,\ldots , k, \ j=1, \ldots , n_i-1$, and $\hat{\alpha }_{\text{ min }}$ is the smallest among $\alpha _{i, n_i}$ for $i=1, \ldots , k$. The proof of this theorem relies on the techniques used to prove Theorem 9 of Mizutani (2014), which gives the robustness of ER to noise. However, as we will see in Sect. 5, Theorem 3 does not directly follow from it. The proof is thus presented. As a lemma for proving the theorem, we use a classical result regarding singular value perturbations. For a matrix $\varvec{A}$, the symbol $\sigma _i(\varvec{A})$ denotes the ith smallest singular value of $\varvec{A}$. In particular, the symbol $\sigma _{\text{ min }}(\varvec{A})$ denotes the smallest singular value, i.e., $\sigma _1(\varvec{A})$.

Lemma 1

(See, for instance, Corollary 8.6.2 of Golub and Loan (2013)) Let $\varvec{A}\in {\mathbb{R}}^{k \times n}$ and $\varvec{N}\in {\mathbb{R}}^{k \times n}$. We have

$$\begin{aligned} |\sigma _i(\varvec{A}+ \varvec{N}) - \sigma _i(\varvec{A})| \le \Vert \varvec{N}\Vert _2 \end{aligned}$$

for $i = 1, \ldots , \ell$ where $\ell = \min \{k, n\}$.

Let us prove Theorem 3.

Proof of Theorem 3

We use $\varvec{p}_{i,j}$ to refer to the columns of $\varvec{P}$. Let $\varvec{M}= [\varvec{p}_{1,n_1}, \ldots , \varvec{p}_{k,n_k}] \in {\mathbb{R}}^{k \times k}$ and $\varvec{p}\in {\mathbb{R}}^k$ be a vector arbitrarily chosen among the vectors $\varvec{p}_{i,j}$ with $j \notin \{n_1, \ldots , n_k\}$. From Proposition 1, it is sufficient to prove that $\varvec{M}$ is nonsingular and there exists some vector $\varvec{c}\in {\mathbb{R}}^k$ such that $\varvec{p}= \varvec{M}\varvec{c}$ and $\Vert \varvec{c}\Vert _2 < 1$.

We use $\varvec{q}_{i,j}$ to refer to the columns of $\varvec{Q}$. Since $\varvec{q}_{i,j} = \alpha _{i,j} \varvec{e}_i$ as shown in (4), we can write $\varvec{M}$ as

$$\begin{aligned} \varvec{M}= \varvec{U}\varvec{V}+ \varvec{R}' \end{aligned}$$

by letting $\varvec{V}= \text{ diag }(\alpha _{1,n_1}, \ldots , \alpha _{k,n_k}) \in {\mathbb{R}}^{k \times k}$ and $\varvec{R}' = [\varvec{r}_{1,n_1}, \ldots , \varvec{r}_{k, n_k}] \in {\mathbb{R}}^{k \times k}$. It follows from Lemma 1 that

$$\begin{aligned} |\sigma _{i}(\varvec{M}) - \sigma _i(\varvec{U}\varvec{V}) | = |\sigma _{i}(\varvec{M}) - \sigma _i(\varvec{V}) | \le \Vert \varvec{R}'\Vert _2 \le \Vert \varvec{R}\Vert _2 \end{aligned}$$

for $i=1,\ldots ,k$. Here, the equality comes from that $\varvec{U}$ is orthogonal, and the second inequality comes from that $\varvec{R}'$ is a submatrix of $\varvec{R}$. Hence, we have $\sigma _{\text{ min }}(\varvec{M}) \ge \sigma _{\text{ min }}(\varvec{V}) - \Vert \varvec{R}\Vert _2 = \hat{\alpha }_{\text{ min }} - \Vert \varvec{R}\Vert _2$. From the bound on $\Vert \varvec{R}\Vert _2$ imposed in this theorem, the inequality implies

$$\begin{aligned} \sigma _{\text{ min }}(\varvec{M}) > \frac{1}{2} (1 + \theta _{\text{ max }}) \hat{\alpha }_{\text{ min }}. \end{aligned}$$

(8)

Since $\hat{\alpha }_{\text{ min }}$ and $\theta _{\text{ max }}$ are positive, so is $\sigma _{\text{ min }}(\varvec{M})$. Accordingly, $\varvec{M}$ is nonsingular.

Let $\varvec{r}= \varvec{r}_{i,j} - \theta _{i,j} \varvec{r}_{i, n_i} \in {\mathbb{R}}^k$, and set a vector $\varvec{c}\in {\mathbb{R}}^k$ as $\varvec{c}= \theta _{i,j} \varvec{e}_i + \varvec{M}^{-1} \varvec{r}$. We have $\varvec{p}= \varvec{p}_{i,j} = \varvec{M}\varvec{c}$, since

$$\begin{aligned} \varvec{M}\varvec{c}&= \varvec{M}(\theta _{i,j} \varvec{e}_i + \varvec{M}^{-1} \varvec{r}) \\&= \alpha _{i,j} \varvec{u}_i + \theta _{i,j} \varvec{r}_{i, n_i} + \varvec{r}\\&= \alpha _{i,j} \varvec{u}_i + \varvec{r}_{i,j} \\&= \varvec{p}_{i,j}. \end{aligned}$$

We can bound $\Vert \varvec{c}\Vert _2$ as

$$\begin{aligned} \Vert \varvec{c}\Vert _2 = \Vert \theta _{i,j} \varvec{e}_i + \varvec{M}^{-1} \varvec{r}\Vert _2 \le \theta _{\text{ max }} + \frac{ \Vert \varvec{r}\Vert _2 }{ \sigma _{\text{ min }}(\varvec{M})}, \end{aligned}$$

(9)

and $\Vert \varvec{r}\Vert _2$ as

$$\begin{aligned} \Vert \varvec{r}\Vert _2&= \Vert \varvec{r}_{i,j} - \theta _{i,j} \varvec{r}_{i, n_i}\Vert _2 \nonumber \\&\le (1+\theta _{\text{ max }}) \Vert \varvec{R}\Vert _2 \nonumber \\&< \frac{1}{2} (1 - \theta _{\text{ max }}^2) \hat{\alpha }_{\text{ min }}. \end{aligned}$$

(10)

The last inequality comes from the bound on $\Vert \varvec{R}\Vert _2$ in this theorem. Accordingly, from inequalities (8), (9) and (10), we have $\Vert \varvec{c}\Vert _2 < 1$.□

4.2 Step 3

Let us move on to the analysis of Step 3.

Theorem 4

Let $\varvec{P}= \varvec{U}\varvec{Q}+ \varvec{R}$ be of the form shown in (7). Let $\Gamma = \{S_1, \ldots , S_k\}$ be the k-way partition of a graph corresponding to $\varvec{Q}$. For the columns $\varvec{p}_i$ of $\varvec{P}$, take the normalized ones $\bar{\varvec{p}}_i = \varvec{p}_i / \Vert \varvec{p}_i\Vert _2$. Assume that we have an element $u_i$ in $S_i$ for every $i = 1, \ldots , k$ . Pick an element v from $S_j$. If the $\ell$th column $\varvec{r}_\ell$ of $\varvec{R}$ satisfies

$$\begin{aligned} \Vert \varvec{r}_\ell \Vert _2 < (17 - 12\sqrt{2}) \alpha _{\text{ min }} \end{aligned}$$

for $\ell = 1,\ldots ,n$, then the chosen $i^* = \arg \max _{i=1, \ldots , k} \bar{\varvec{p}}_{u_i}^\top \bar{\varvec{p}}_v$ satisfies $i^* = j$.

Recall that $\alpha _{\text{ min }}$ is the smallest among all $\alpha _{i,j}$ for $i=1, \ldots , k, \ j=1, \ldots , n_i$. Before going to the proof, we derive the range of $\Vert \varvec{R}\Vert _2$ that covers the ranges imposed in Theorems 3 and 4. From the definition, we have $\alpha _{\text{ min }} = \min _{i=1,\ldots ,k}\alpha _{i,1}$. In addition, $\alpha _{i,1}$ can be written as $\alpha _{i,1} = (\alpha _{i,1} / \alpha _{i,n_i}) \cdot \alpha _{i,n_i} = \theta _{i,1} \alpha _{i,n_i}$. Hence, $\alpha _{\text{ min }} \ge \theta _{\text{ min }} \hat{\alpha }_{\text{ min }}$ holds. Accordingly, if $\Vert \varvec{R}\Vert _2 < (17-12\sqrt{2})\theta _{\text{ min }} \hat{\alpha }_{\text{ min }}$, we see that $\Vert \varvec{r}_\ell \Vert _2$ covers the range imposed in Theorem 4, since

$$\begin{aligned} \Vert \varvec{r}_\ell \Vert _2 \le \Vert \varvec{R}\Vert _2 < (17-12\sqrt{2})\theta _{\text{ min }} \hat{\alpha }_{\text{ min }} \le (17-12\sqrt{2})\alpha _{\text{ min }} \end{aligned}$$

for $\ell = 1,\ldots ,n$. Consequently, Theorems 3 and 4 imply that we can identify a graph partition from the columns of $\varvec{P}$, if

$$\begin{aligned} \Vert \varvec{R}\Vert _2 < \hat{\alpha }_{\text{ min }} \cdot \min \biggl \{ \frac{1}{2} (1-\theta _{\text{ max }}), \ (17-12\sqrt{2})\theta _{\text{ min }} \biggr \}. \end{aligned}$$

(11)

Now let us prove Theorem 4. First, we build Propositions 2 and 3. In what follows, we use $\varvec{p}_{i,j}$ to refer to the columns of $\varvec{P}$. Let

$$\begin{aligned} \varvec{z}_{i, j} := \varvec{U}^\top \varvec{p}_{i,j} \in {\mathbb{R}}^k \quad \text{ and } \quad \bar{\varvec{z}}_{i,j} := \frac{\varvec{z}_{i,j}}{\Vert \varvec{z}_{i,j}\Vert _2} \in {\mathbb{R}}^k. \end{aligned}$$

Since $\varvec{p}_{i,j} = \alpha _{i,j}\varvec{u}_i + \varvec{r}_{i,j}$, we can express $\varvec{z}_{i, j}$ as

$$\begin{aligned} \varvec{z}_{i,j} = \alpha _{i,j} \varvec{e}_{i} + \varvec{n}_{i,j} \end{aligned}$$

by letting $\varvec{n}_{i,j} := \varvec{U}^\top \varvec{r}_{i,j} \in {\mathbb{R}}^k$. Since $\varvec{U}$ is orthogonal, $\bar{\varvec{p}}_{i,j}^\top \bar{\varvec{p}}_{u,v} = \bar{\varvec{z}}_{i,j}^\top \bar{\varvec{z}}_{u,v}$ and $\Vert \varvec{n}_{i,j}\Vert _2 = \Vert \varvec{r}_{i,j}\Vert _2$. We will examine the inner product of $\bar{\varvec{z}}_{i,j}$ and $\bar{\varvec{z}}_{u,v}$ instead of the one of $\bar{\varvec{p}}_{i,j}$ and $\bar{\varvec{p}}_{u,v}$. In the proposition below, we use $(\varvec{a})_i$ to denote the ith element of the vector $\varvec{a}$.

Proposition 2

Let $\bar{\varvec{z}}_{i,j}$ be defined as above. The i th element is bounded from below:

$$\begin{aligned} (\bar{\varvec{z}}_{i,j})_i \ge \frac{\alpha _{\text{ min }} - \Vert \varvec{n}_{i,j}\Vert _2 }{\alpha _{\text{ min }} + \Vert \varvec{n}_{i,j}\Vert _2 }. \end{aligned}$$

Proof

From the definition,

$$\begin{aligned} \bar{\varvec{z}}_{i,j} = \frac{\varvec{z}_{i,j}}{\Vert \varvec{z}_{i,j}\Vert _2} = \frac{\alpha _{i,j} \varvec{e}_{i} + \varvec{n}_{i,j}}{\Vert \alpha _{i,j} \varvec{e}_{i} + \varvec{n}_{i,j}\Vert _2}. \end{aligned}$$

Hence, the ith element of $\bar{\varvec{z}}_{i,j}$ is given as

$$\begin{aligned} (\bar{\varvec{z}}_{i,j})_i = \frac{\alpha _{i,j} + (\varvec{n}_{i,j})_i}{\Vert \alpha _{i,j} \varvec{e}_{i} + \varvec{n}_{i,j}\Vert _2}. \end{aligned}$$

Since $|(\varvec{n}_{i,j})_i| \le \Vert \varvec{n}_{i,j}\Vert _{\infty } \le \Vert \varvec{n}_{i,j}\Vert _2$ holds, we have, for the numerator, $\alpha _{i,j} + (\varvec{n}_{i,j})_i \ge \alpha _{i,j} - \Vert \varvec{n}_{i,j}\Vert _2$. Also, we have, for the square of the denominator,

$$\begin{aligned} \Vert \alpha _{i,j} \varvec{e}_{i} + \varvec{n}_{i,j}\Vert _2^2&= \alpha _{i,j}^2 + 2 \alpha _{i,j} (\varvec{n}_{i,j})_i + \Vert \varvec{n}_{i,j}\Vert _2^2 \\&\le \alpha _{i,j}^2 + 2 \alpha _{i,j} \Vert \varvec{n}_{i,j}\Vert _2 + \Vert \varvec{n}_{i,j}\Vert _2^2 \\&= (\alpha _{i,j} + \Vert \varvec{n}_{i,j}\Vert _2)^2. \end{aligned}$$

This means $\Vert \alpha _{i,j} \varvec{e}_{i} + \varvec{n}_{i,j}\Vert _2 \le \alpha _{i,j} + \Vert \varvec{n}_{i,j}\Vert _2$. Accordingly, the ith element of $\bar{\varvec{z}}_{i,j}$ is bounded from below as follows.□

$$\begin{aligned} (\bar{\varvec{z}}_{i,j})_i \ge \frac{\alpha _{i,j} - \Vert \varvec{n}_{i,j}\Vert _2}{\alpha _{i,j} + \Vert \varvec{n}_{i,j}\Vert _2}. \end{aligned}$$

For some nonnegative real number c, the function $f(x) = \frac{x - c}{x + c}$ for positive real numbers x is monotonically nondecreasing. Consequently, since $\alpha _{i,j} \ge \alpha _{\text{ min }}$, we obtain

$$\begin{aligned} (\bar{\varvec{z}}_{i,j})_i \ge \frac{\alpha _{i,j} - \Vert \varvec{n}_{i,j}\Vert _2}{\alpha _{i,j} + \Vert \varvec{n}_{i,j}\Vert _2} \ge \frac{\alpha _{\text{ min }} - \Vert \varvec{n}_{i,j}\Vert _2}{\alpha _{\text{ min }} + \Vert \varvec{n}_{i,j}\Vert _2}. \end{aligned}$$

□

For some real number $\xi$ satisfying $0 \le \xi \le 1$, let

$$\begin{aligned} C(i, \xi ) := \{\varvec{x}\in {\mathbb{R}}^k : \Vert \varvec{x}\Vert _2 = 1, \ x_i \ge \xi \} \end{aligned}$$

where $x_i$ is the ith element of $\varvec{x}$. If $\Vert \varvec{n}_{i,j} \Vert _2 \le \alpha _{\text{ min }}$, Proposition 2 tells us that

$$\begin{aligned} \bar{\varvec{z}}_{i,j} \in C \biggl ( i, \ \frac{\alpha _{\text{ min }} - \Vert \varvec{n}_{i,j}\Vert _2 }{\alpha _{\text{ min }} + \Vert \varvec{n}_{i,j}\Vert _2 } \biggl ). \end{aligned}$$

Obviously, the inner product of two elements from the same set $C(i, \xi )$ is large, while that of two elements from different sets $C(i, \xi )$ and $C(j, \xi )$ with $i \ne j$ is small. In the lemma below, we present bounds on those inner products.

Lemma 2

(a)
Let $\varvec{a}, \varvec{b}\in C(i, \xi )$. Then, $\varvec{a}^\top \varvec{b}\ge 2 \xi ^2 - 1$.
(b)
Let $\varvec{a}\in C(i, \xi )$ and $\varvec{b}\in C(j, \xi )$ for $i \ne j$. Then, $\varvec{a}^\top \varvec{b}\le - \xi ^2 + 2\sqrt{1 - \xi ^2} + 1$.

It is easy to prove this lemma. We thus put the proof in the Appendix. The proposition below immediately follows from the lemma.

Proposition 3

Assume that we have an element $\varvec{a}_i$ in $C(i, \xi )$ for every $i = 1,\ldots ,k$. Pick an element $\varvec{a}$ from $C(j, \xi )$. If $\xi > \frac{2\sqrt{2}}{3}$, then the chosen $i^* = \arg \max _{i = 1,\ldots ,k} \varvec{a}_i^\top \varvec{a}$ satisfies $i^* = j$.

Proof

Lemma 2 tells us that $\varvec{a}_i^\top \varvec{a}\ge 2 \xi ^2 - 1$ if $i = j$; otherwise, $\varvec{a}_i^\top \varvec{a}\le - \xi ^2 + 2\sqrt{1 - \xi ^2} + 1$. Let us examine the range of $\xi$ such that $2\xi ^2 - 1$ is strictly larger than $- \xi ^2 + 2\sqrt{1 - \xi ^2} + 1$. Let f be a function defined by $f(x) = 3x^2 -2 \sqrt{1 - x^2} -2$ for $0 \le x \le 1$. This function is monotonically increasing for $0< x < 1$; $f(0) = -4$ and $f(1) = 1$; and $f(x) = 0$ when $x = \frac{2\sqrt{2}}{3}$. We thus see that f takes a positive value if $x > \frac{2\sqrt{2}}{3}$. Accordingly, if $\xi > \frac{2\sqrt{2}}{3}$,

$$\begin{aligned} \varvec{a}_j^\top \varvec{a}\ge 2 \xi ^2 - 1 > - \xi ^2 + 2\sqrt{1 - \xi ^2} + 1 \ge \varvec{a}_i^\top \varvec{a}\end{aligned}$$

holds for every $i \in \{1, \ldots , k\} {\setminus } \{j\}$. Consequently, the choice of $i^* = \arg \max _{i=1, \ldots , k} \varvec{a}_i^\top \varvec{a}$ satisfies $i^* = j$.□

We are now ready to prove Theorem 4.

Proof of Theorem 4

Proposition 2 implies

$$\begin{aligned} \bar{\varvec{z}}_{i,j} \in C\biggl (i, \ \frac{\alpha _{\text{ min }} - \Vert \varvec{n}_{i,j}\Vert _2}{\alpha _{\text{ min }} + \Vert \varvec{n}_{i,j}\Vert _2} \biggr ) \end{aligned}$$

if $\Vert \varvec{n}_{i,j}\Vert _2 \le \alpha _{\text{ min }}$. This theorem imposes the condition that $\Vert \varvec{r}_{i,j}\Vert _2 < (17-12\sqrt{2})\alpha _{\text{ min }}$. Hence, $\bar{\varvec{z}}_{i,j}$ belongs to $C(i, \xi )$ such that $\xi$ satisfies $\frac{2\sqrt{2}}{3} < \xi \le 1$, since

$$\begin{aligned} \frac{\alpha _{\text{ min }} - \Vert \varvec{n}_{i,j}\Vert _2}{\alpha _{\text{ min }} + \Vert \varvec{n}_{i,j}\Vert _2} = \frac{\alpha _{\text{ min }} - \Vert \varvec{r}_{i,j}\Vert _2}{\alpha _{\text{ min }} + \Vert \varvec{r}_{i,j}\Vert _2} > \frac{2\sqrt{2}}{3}. \end{aligned}$$

The equality comes from that $\varvec{n}_{i,j} = \varvec{U}^\top \varvec{r}_{i,j}$ and $\varvec{U}$ is orthogonal. We have $\bar{\varvec{p}}_{u_i}^\top \bar{\varvec{p}}_v = \bar{\varvec{z}}_{u_i}^\top \bar{\varvec{z}}_v$. In addition, this theorem assumes that we have $u_i \in S_i$ for $i=1,\ldots ,k$ and picks $v \in S_j$. Hence, $\bar{\varvec{z}}_{u_i} \in C(i, \xi )$ for $i = 1, \ldots , k$ and $\bar{\varvec{z}}_{v} \in C(j, \xi )$. Accordingly, Proposition 3 ensures that the chosen $i^* = \arg \max _{i=1,\ldots ,k} \bar{\varvec{p}}_{u_i}^\top \bar{\varvec{p}}_v = \bar{\varvec{z}}_{u_i}^\top \bar{\varvec{z}}_v$ satisfies $i^* = j$.□

4.3 Proofs of Theorem 2 and Corollary 1

Theorem 2 is proved using Theorems 1, 3 and 4.

Proof of Theorem 2

Let a k-way partition $\Gamma = \{S_1, \ldots , S_k\}$ of a graph G be optimal. We take the normalized indicators $\bar{\varvec{g}}_1, \ldots , \bar{\varvec{g}}_k$ of $S_1, \ldots , S_k$. Let $\varvec{Q}= [\bar{\varvec{g}}_1, \ldots , \bar{\varvec{g}}_k]^\top \in {\mathbb{R}}^{k \times n}$. Step 1 of ELLI computes the bottom k eigenvectors $\varvec{f}_1, \ldots , \varvec{f}_k$ of the normalized Laplacian $\mathcal {L}$ of G. Let $\varvec{P}= [\varvec{f}_1, \ldots , \varvec{f}_k]^\top \in {\mathbb{R}}^{k \times n}$. A graph G satisfies $\Upsilon > 4k / (\theta \alpha )^2$. This implies that the relation $\Upsilon \ge k$ holds, since we have $0 \le \theta \alpha \le 1$ from the definitions of $\theta$ and $\alpha$. Hence, Theorem 1 ensures that there is some orthogonal matrix $\varvec{U}\in {\mathbb{R}}^{k \times k}$ such that $\varvec{P}= \varvec{U}\varvec{Q}+ \varvec{R}$ and

$$\begin{aligned} \Vert \varvec{R}\Vert _2 \le 2\sqrt{\frac{k}{\Upsilon }} = 2 \sqrt{\frac{k \cdot \phi _k(G)}{\lambda _{k+1}}}. \end{aligned}$$

Since G satisfies $\Upsilon > 4k / (\theta \alpha )^2$, we have

$$\begin{aligned} \Upsilon = \frac{\lambda _{k+1}}{\phi _k(G)} > \frac{4k}{(\theta \alpha )^2} \Leftrightarrow \frac{4k \cdot \phi _k(G)}{\lambda _{k+1}} < (\theta \alpha )^2. \end{aligned}$$

This implies $\Vert \varvec{R}\Vert _2 < \theta \alpha$. Recalling the range shown in (11) in Sect. 4.2, we see that $\Vert \varvec{R}\Vert _2$ lies within both ranges imposed in Theorems 3 and 4. Theorem 3 ensures that the set I constructed in Step 2 is the representative node set of the optimal k-way partition $\Gamma$. Let $u_1, \ldots , u_k$ be elements of I such that $u_i \in S_i$ for $i=1,\ldots ,k$. Theorem 4 ensures that, if $v \in S_i$, then Step 3 adds v to the set $T_i$. Consequently, the obtained set $T_i$ coincides with the cluster $S_i$ in the optimal k-way partition $\Gamma$ for $i=1,\ldots ,k$.

Corollary 1 immediately follows from Theorem 2.□

Proof of Corollary 1

The first and second conditions of Corollary 1 lead to

$$\begin{aligned} \theta _{\text{ max }} \le 1 - \frac{1}{\sqrt{k^p}} \quad \text{ and } \quad \theta _{\min } \ge \frac{1}{\sqrt{k^q}}. \end{aligned}$$

From the inequalities, we obtain

$$\begin{aligned} \frac{1}{2} (1 - \theta _{\text{ max }}) \ge \frac{1}{2\sqrt{k^p}} \ge \frac{1}{2\sqrt{k^p + k^q}}, \end{aligned}$$

and

$$\begin{aligned} (17 - 12 \sqrt{2}) \theta _{\text{ min }} \ge \frac{1}{50 \sqrt{k^q}} \ge \frac{1}{50 \sqrt{k^p + k^q}}. \end{aligned}$$

Hence, we can bound $\theta$ as follows.

$$\begin{aligned} \theta = \min \left\{ \frac{1}{2}(1 - \theta _{\text{ max }}), (17 - 12\sqrt{2}) \theta _{\text{ min }} \right\} \ge \frac{1}{50 \sqrt{k^p + k^q}}. \end{aligned}$$

Also, using the third condition of Corollary 1, we can bound $\alpha$ as follows.

$$\begin{aligned} \alpha \ge \frac{1}{\sqrt{k^r}}. \end{aligned}$$

The bounds on $\theta$ and $\alpha$ imply

$$\begin{aligned} \frac{4k}{(\theta \alpha )^2} \le 10^4 \cdot k^{r+1}(k^p + k^q) \le 2 \cdot 10^4 \cdot k^{(r+1) \cdot { \max \{p, q\}}}. \end{aligned}$$

Accordingly, if $\Upsilon = \Omega \left( k^{(r+1) \cdot { \max \{p, q\}}} \right)$, Theorem 2 ensures that the output of ELLI is the optimal k-way partition of a graph.□

5 Connection with computing separable NMFs

Finding the representative node set of a graph partition is connected with computing separable NMFs. In what follows, we will use the symbol ${\mathbb{R}}_+^{d \times n}$ to denote the set of $d \times n$ nonnegative matrices. Let a nonnegative matrix $\varvec{A}\in {\mathbb{R}}_+^{d \times n}$ have a factorization such that $\varvec{A}= \varvec{S}\varvec{C}$ for nonnegative matrices $\varvec{S}\in {\mathbb{R}}_+^{d \times k}$ and $\varvec{C}\in {\mathbb{R}}_+^{k \times n}$. This factorization is referred to as NMF. The NMF of $\varvec{A}$ is called separable if it can be further factorized into

$$\begin{aligned} \varvec{A}= \varvec{S}\varvec{C}\ \text{ for } \ \varvec{S}\in {\mathbb{R}}_+^{d \times k} \quad \text{ and } \quad \varvec{C}= [\varvec{I}, \varvec{H}]\varvec{\Pi }\in {\mathbb{R}}_+^{k \times n}. \end{aligned}$$

(12)

Here, $\varvec{I}$ is a $k \times k$ identity matrix, $\varvec{H}$ is a $k \times (n-k)$ nonnegative matrix, and $\varvec{\Pi }$ is an $n \times n$ permutation matrix. A separable NMF is a special case of an NMF that satisfies the further condition that all columns of $\varvec{S}$ appear in those of $\varvec{A}$. Given $\varvec{A}$ of the form shown in (12), a separable NMF problem is one of finding a column index set I with k elements such that $\varvec{A}(I) = \varvec{S}$. Here, $\varvec{A}(I)$ denotes the submatrix of $\varvec{A}$ indexed by I, i.e., $\varvec{A}(I) = [\varvec{a}_i : i \in I]$ for the ith column $\varvec{a}_i$ of $\varvec{A}$. We call such an index set I the basis index set. The notion of separability was introduced by Donoho and Stodden (2003). Arora et al. (2012) showed that separable NMF problems are solvable in polynomial time.

Let $\varvec{A}$ be of the form shown in (12). The perturbed matrix is given by $\tilde{\varvec{A}} = \varvec{A}+ \varvec{R}$ for $\varvec{R}\in {\mathbb{R}}^{k \times n}$, which is the noise added to $\varvec{A}$. Let us choose one algorithm for solving separable NMF problems. Given $\tilde{\varvec{A}}$ and k, we say that the algorithm is robust to noise if it finds a column index set I with k elements such that $\tilde{\varvec{A}}(I)$ is close to $\varvec{S}$. There are several algorithms that have been shown to be robust to noise. SPA is an algorithm for solving separable NMF problems, and Gillis and Vavasis (2014) examined its robustness. They built the following setup for their analysis. A matrix $\varvec{A}\in {\mathbb{R}}^{d \times n}$ is factorized into

$$\begin{aligned} \varvec{A}= \varvec{S}\varvec{C}\ \text{ for } \ \varvec{S}\in {\mathbb{R}}^{d \times k} \quad \text{ and } \quad \varvec{C}= [\varvec{I}, \varvec{H}]\varvec{\Pi }\in {\mathbb{R}}_+^{k \times n} \end{aligned}$$

(13)

satisfying the following two conditions.

(A1)
$\varvec{S}$ is full column rank.
(A2)
Every column $\varvec{h}_i$ of $\varvec{H}$ satisfies $\Vert \varvec{h}_i\Vert _1 \le 1$.

In their setup, $\varvec{C}$ has to be nonnegative, but $\varvec{S}$ does not necessarily have to be nonnegative, unlike $\varvec{S}$ in (12). Let $\varvec{A}$ be of the form shown in (13) and assume that it satisfies conditions (A1) and (A2). The perturbed matrix is given by $\tilde{\varvec{A}} = \varvec{A}+ \varvec{R}$ for $\varvec{R}\in {\mathbb{R}}^{k \times n}$. Given $\tilde{\varvec{A}}$ and k as input, Gillis and Vavasis showed in Theorem 3 of Gillis and Vavasis (2014) that, if $\Vert \varvec{R}\Vert _2$ is smaller than some threshold, then SPA finds a column index set I such that $\tilde{\varvec{A}}(I)$ is close to $\varvec{S}$. Gillis (2014) developed a successive nonnegative projection algorithm (SNPA), and showed in Theorem 3.22 of Gillis (2014) of the paper that a similar robustness result holds for SNPA even if condition (A1) is replaced with a weaker condition. ELLI uses ER. The robustness of ER was shown in Theorem 9 of Mizutani (2014), which assumes that conditions (A1) and (A2) are satisfied, and, in addition, $\varvec{S}$ is nonnegative.

Let us go back to $\varvec{P}= \varvec{U}\varvec{Q}+ \varvec{R}$ shown in (7). By choosing an $n \times n$ permutation matrix $\varvec{\Pi }$, we can rearrange the columns of $\varvec{Q}$ such that

$$\begin{aligned} \varvec{Q}= \left[ \begin{array}{ccc | ccc | c | ccc} \alpha _{1,n_1} &{} &{} &{} \alpha _{1,1} &{} \cdots &{} \alpha _{1, n_1-1} &{} &{} &{} &{} \\ &{} \ddots &{} &{} &{} &{} &{} \ddots &{} &{} &{} \\ &{} &{} \alpha _{k,n_k} &{} &{} &{} &{} &{} \alpha _{k,1} &{} \cdots &{} \alpha _{k, n_k-1} \end{array} \right] \varvec{\Pi }\in {\mathbb{R}}^{k \times n} \end{aligned}$$

where $\alpha _{i,j} = \sqrt{d_{i,j} / \mu (S_i)}$. Let $\varvec{V}$ denote the $k \times k$ submatrix consisting of the first k columns, $\varvec{V}= \text{ diag }(\alpha _{1,n_1}, \ldots , \alpha _{k,n_k})$. This is nonsingular since all $d_{i,j}$ are positive. Let $\varvec{H}$ denote a matrix obtained by the product of $\varvec{V}^{-1}$ and the remaining $k \times (n-k)$ submatrix,

$$\begin{aligned} \varvec{H}= \left[ \begin{array}{ccc|ccc|c|ccc} \theta _{1,1} &{} \cdots &{} \theta _{1, n_1-1} &{} &{} &{} &{} &{} &{} &{} \\ &{} &{} &{} \theta _{2,1} &{} \cdots &{} \theta _{2, n_2-1} &{} &{} &{} &{} \\ &{} &{} &{} &{} &{} &{} \ddots &{} &{} &{} \\ &{} &{} &{} &{} &{} &{} &{} \theta _{k,1} &{} \cdots &{} \theta _{k, n_k-1} \\ \end{array} \right] \in {\mathbb{R}}^{k \times (n-k)} \end{aligned}$$

where $\theta _{i,j} = \alpha _{i,j} / \alpha _{i,n_i} = \sqrt{d_{i,j} / d_{i,n_i}}$. Accordingly, $\varvec{Q}$ can be rewritten as

$$\begin{aligned} \varvec{Q}= \varvec{V}\varvec{C}\ \text{ for } \ \varvec{V}\in {\mathbb{R}}_+^{k \times k} \ \text{ and } \ \varvec{C}= [\varvec{I},\varvec{H}]\varvec{\Pi }\in {\mathbb{R}}_+^{k \times n}. \end{aligned}$$

The above shows that $\varvec{Q}= \varvec{V}\varvec{C}$ is NMF and separable and the basis index set corresponds to the representative node set of the k-way partition of a graph. We can therefore see that finding the representative node set of a graph partition is equivalent to finding the basis index set of $\varvec{Q}$. The grouping stage of spectral clustering needs to deal with $\varvec{P}$ rather than $\varvec{Q}$. This matrix is given as

$$\begin{aligned} \varvec{P}= \varvec{U}\varvec{Q}+ \varvec{R}= \varvec{U}\varvec{V}\varvec{C}+ \varvec{R}. \end{aligned}$$

The matrix $\varvec{U}\varvec{V}$ is not necessarily nonnegative. Nevertheless, we can perform SPA and SNPA on $\varvec{P}$ and ensure their robustness if $\Vert \varvec{R}\Vert _2$ is small, because the matrix $\varvec{U}\varvec{V}\varvec{C}$ satisfies conditions (A1) and (A2). Indeed, $\varvec{U}\varvec{V}$ is $k \times k$ and nonsingular because $\varvec{U}$ and $\varvec{V}$ both are. In addition, $\varvec{C}= [\varvec{I}, \varvec{H}]\varvec{\Pi }$ is nonnegative and every column $\varvec{h}_i$ of $\varvec{H}$ satisfies $\Vert \varvec{h}_i\Vert _1 \le \theta _{\text{ max }} \le 1$. Meanwhile, Theorem 9 of Mizutani (2014), which describes the robustness of ER, is invalid for $\varvec{P}$, because $\varvec{U}\varvec{V}$ is required to be nonnegative.

Remark 2

In Sect. 4.1, we mentioned the size of $\Vert \varvec{R}\Vert _2$. That is, if $\Upsilon \ge k$ holds, then we have $\Vert \varvec{R}\Vert _2 \le 2\sqrt{k / \Upsilon }$ for $\varvec{Q}$ corresponding to the optimal k-way partition of a graph. Hence, if the graph is well-clustered, $\Vert \varvec{R}\Vert _2$ is small.

6 Experiments

We describe experiments conducted on synthetic and real data. The details of the implementations of ELLI and KSC are as follows.

(ELLI) Step 1 computes the bottom k eigenvectors of the normalized Laplacian. For this computation, we used the MATLAB command eigs, choosing the value ’sa’ in the input argument. Step 2 computes an origin-centered MVEE for the set of points. We used an interior-point method working within a cutting plane framework. Our implementation followed Algorithm 3 of Mizutani (2014) and used the interior-point method in the SDPT3 software package Toh et al. (1999).
(KSC) Our implementation followed the algorithm in von Luxburg (2007) that is referred to as normalized spectral clustering according to Shi and Malik. In the embedding stage, we used the MATLAB command eigs with the same settings in ELLI for the eigenvector computation. We then constructed the spectral embedding map F of the form shown in (2) with a scaling factor $s_u = 1 / \sqrt{d_u}$ for the degree $d_u$ of node u. In the grouping stage, we adopted the k-means++ algorithm (Arthur & Vassilvitskii, 2007) for finding a k-way partition of a set X of points $F(1), \ldots , F(n)$, since our preliminary experiments indicated that the k-means++ algorithm outperformed Lloyd’s algorithm (Lloyd, 1982). To perform it, we used the MATLAB command kmeans, choosing the following values in the input arguments: ’Start’, ’plus’, ’EmptyAction’, ’singleton’, ’MaxIter’, 1000. The value in the argument ’MaxIter’ specifies the maximum number of iterations. We set it to 1000.

The experiments were conducted on an Intel Xeon Processor E5-1620 with 32 GB memory running MATLAB R2016a.

6.1 Synthetic data

The first experiments assessed how close the conductance of the k clusters found by ELLI and KSC were to the k-way conductance $\phi _k(G)$ of the graph. As it is hard to compute $\phi _k(G)$, we used synthetic data for which an upper bound on $\phi _k(G)$ is easily obtainable. Specifically, we synthesized adjacency matrices and constructed the normalized Laplacians from them.

We will use the following notation to describe the generation procedure. For integers p and q with $p \le q$, the notation [p : q] indicates the set of all integers from p to q. For instance, $[1:3] = \{1,2,3\}$. For a matrix $\varvec{A}\in {\mathbb{R}}^{m \times n}$ with $m \ge n$, we take a set $K = [p:q]$ satisfying $K \subset [1:n]$. The symbol $\varvec{A}_K$ denotes the submatrix $[a_{ij} : i \in K, \ j \in K]$ of $\varvec{A}$, where $a_{ij}$ is the (i, j)th element of $\varvec{A}$,

$$\begin{aligned} \varvec{A}_K = \left[ \begin{array}{ccc} a_{pp} &{} \cdots &{} a_{pq} \\ \vdots &{} \ddots &{} \vdots \\ a_{qp} &{} \cdots &{} a_{qq} \\ \end{array} \right] . \end{aligned}$$

The following procedure was used to make the adjacency matrix.

1.
Choose an $n \times n$ symmetric matrix $\varvec{M}$ such that the diagonal elements are all 0 and the other elements lie in the interval $\{x : 0< x < 1\}$.
2.
Choose k integers $n_1, \ldots , n_k$ satisfying $n = n_1 + \cdots + n_k$ and construct
$$\begin{aligned} S_i = \Biggl [ \sum _{\ell =1}^{i-1} n_\ell + 1 : \sum _{\ell =1}^{i} n_\ell \Biggr ] \end{aligned}$$
for $i = 1, \ldots , k$.
3.
Let $\varvec{B}$ be an $n \times n$ block diagonal matrix
$$\begin{aligned} \left[ \begin{array}{ccc} \varvec{M}_{S_1} &{} &{} \\ &{} \ddots &{} \\ &{} &{} \varvec{M}_{S_k} \\ \end{array} \right] \end{aligned}$$
where $\varvec{M}_{S_1}, \ldots , \varvec{M}_{S_k}$ are the submatrices of $\varvec{M}$ indexed by $S_1, \ldots , S_k$. Also, let $\varvec{R}$ be an off-block diagonal matrix $\frac{1}{2} (\varvec{M}- \varvec{B})$. Choose a value of the intensity parameter $\delta$ from 0 to 2 and generate an $n \times n$ symmetric matrix $\varvec{W}= \varvec{B}+ \delta \varvec{R}$.

The generated matrix $\varvec{W}$ is regarded as the adjacency matrix for some graph G, and the constructed sets $S_1, \ldots , S_k$ are clusters in the k-way partition of G. When $\delta = 0$, the matrix $\varvec{W}$ is clean block diagonal, and the corresponding graph G consists of k connected components. As $\delta$ increases, the block structure gradually disappears. When $\delta = 2$, the original matrix $\varvec{M}$ is reacquired. A simple calculation shows that the conductance $\phi (S_i)$ of $S_i$ is

$$\begin{aligned} \phi (S_i) = \frac{\delta }{c_i + \delta }. \end{aligned}$$

Here, $c_i$ is a positive number determined by $\varvec{M}$. Hence, $\delta / (c + \delta )$ with $c = \min \{c_1, \ldots , c_k\}$ is the maximum of $\phi (S_1), \ldots , \phi (S_k)$. Hence, we can see that the k-way conductance $\phi _k(G)$ of G is bounded from above by the function $f(x) = x / (c + x)$. It may serve as a good upper bound on $\phi _k(G)$. In particular, the bound can be tight if $\delta$ is sufficiently small.

On the basis of the above procedure, the experiments generated two types of dataset: balanced and unbalanced. We set $n = 10,000$ and constructed $\Gamma _1$ and $\Gamma _2$ as follows.

$\Gamma _1 = \{S_1, \ldots , S_{50}\}$ with
$$\begin{aligned} |S_1| = \cdots = |S_{50}| = 200. \end{aligned}$$
$\Gamma _2 = \{S_1, \ldots , S_{143}\}$ with
$$\begin{aligned} |S_1| = |S_2| = |S_3| = 1000 \quad \text{ and } \quad |S_4| = \cdots = |S_{143}| = 50. \end{aligned}$$

We constructed adjacency matrices with an intensity parameter $\delta$ running from 0 to 2 in increments of 0.1 for each $\Gamma _1$ and $\Gamma _2$. The set of adjacency matrices for $\Gamma _1$ was the balanced dataset, while that of $\Gamma _2$ was the unbalanced dataset.

The experiments ran ELLI and KSC on the normalized Laplacians produced from the datasets. The quality of the obtained clusters was evaluated by the maximum value of cluster conductance (MCC), defined by

$$\begin{aligned} \max \{\phi (S_1), \ldots , \phi (S_k)\} \end{aligned}$$

for the output $\{ S_1, \ldots , S_k \}$ of the algorithm. KSC repeated the k-means++ algorithm 100 times for each input. Hence, the evaluation of clusters returned by KSC was the average MCC over 100 trials.

Figure 2 shows the experimental results. The top two figures are the results of ELLI and KSC on the balanced dataset, and the bottom two figures are those of the unbalanced dataset. The red points in the left figures are the MCC of ELLI, while those in the right figures are the average MCC of KSC. The black dotted line depicts the function $f(x) = x / (c + x)$ that serves as an upper bound on graph conductance. We can see from the figures that the MCC of ELLI approaches the upper bound, and it seems to be lower than the average MCC of KSC on both datasets. Figure 3 clarifies the differences between them for the balanced dataset and unbalanced dataset. Each red point plots the average MCC of KSC minus the MCC of ELLI. We clearly see from the figures that the MCC of ELLI is consistently below the average MCC of KSC except for $\delta = 0$.

6.2 Real data

The second experiments assessed how effective ELLI is at clustering real data. We chose several image databases containing images that had been categorized into classes by human judges. Then, we constructed image datasets by using the whole or some parts of the databases. We evaluated how well the clusters found by ELLI matched the classes of the datasets.

For comparison, we tested the graph regularized NMF (GNMF) proposed in Cai et al. (2011) and KSC. GNMF is known to be effective at clustering. Before describing the details of the experiments, let us briefly describe the clustering algorithm with the use of GNMF. Let n data vectors $\varvec{a}_1, \ldots , \varvec{a}_n \in {\mathbb{R}}^d$ be nonnegative. The clustering algorithm maps the n data vectors to n points in a lower dimensional space; then it applies the k-means method to the points. The mapping is constructed using GNMF. For a nonnegative matrix $\varvec{A}\in {\mathbb{R}}^{d \times n}_+$ that stacks $\varvec{a}_1, \ldots , \varvec{a}_n$ in columns, the GNMF problem is one of finding two nonnegative matrices $\varvec{X}\in {\mathbb{R}}^{d \times k}_+$ and $\varvec{Y}\in {\mathbb{R}}^{k \times n}_+$ that minimize the cost function,

$$\begin{aligned} f(\varvec{X},\varvec{Y}) = \Vert \varvec{A}- \varvec{X}\varvec{Y}\Vert _F^2 + \lambda \cdot \text{ tr }(\varvec{Y}\varvec{L}\varvec{Y}^\top ). \end{aligned}$$

Here, $\lambda$ is a positive parameter the user specifies, and $\varvec{L}$ is the Laplacian of the adjacency matrix $\varvec{W}$ formed from the data vectors. The symbol $\text{ tr }(\cdot )$ denotes the trace of a matrix. This is an NMF problem with a regularization term $\text{ tr }(\varvec{Y}\varvec{L}\varvec{Y}^\top )$. After solving it heuristically, the clustering algorithm regards the columns $\varvec{y}_1, \ldots , \varvec{y}_n$ of $\varvec{Y}$ as the representations of the data vectors $\varvec{a}_1, \ldots , \varvec{a}_n$ in ${\mathbb{R}}^k$ and applies the k-means method to $\varvec{y}_1, \ldots , \varvec{y}_n$. If two data vectors $\varvec{a}_i$ and $\varvec{a}_j$ are close, so should be the corresponding two columns $\varvec{y}_i$ and $\varvec{y}_j$. The regularization term serves to make $\varvec{y}_i$ and $\varvec{y}_j$ even closer. Indeed, we can rewrite the term as

$$\begin{aligned} \text{ tr }(\varvec{Y}\varvec{L}\varvec{Y}^\top ) = \frac{1}{2} \sum _{i=1}^{n} \sum _{j=1}^{n} w_{ij} \Vert \varvec{y}_i - \varvec{y}_j \Vert _2^2 \end{aligned}$$

for the (i, j)th element $w_{ij}$ of the adjacency matrix $\varvec{W}$. If $\varvec{a}_i$ and $\varvec{a}_j$ are close together, then, $w_{ij}$ takes a high value. Hence, we expect that the columns $\varvec{y}_i$ and $\varvec{y}_j$ of $\varvec{Y}$ found by solving the GNMF problem are also close together. The code for GNMF is available from the website of the first author in Cai et al. (2011). In the experiment, we used it for computing $\varvec{Y}$. For the clustering of the columns of $\varvec{Y}$, we used the MATLAB code kmeans with the same settings as KSC.

The experiment used five image databases: EMNIST (Cohen et al., 2017), ETL, Fashion-MNIST (Xiao et al., 2017), MNIST (Lecun et al., 1998), and NDL. The MNIST database is a standard benchmark for evaluating clustering algorithms. It contains the images of ten handwritten digits from 0 to 9. We used all images in it. MNIST is derived from the NIST Special Database. The EMNIST database is an extension of MNIST that consists of six datasets. Among them, the EMNIST-Balanced dataset contains the images of handwritten alphabet letters and digits. We used this dataset. The Fashion-MNIST database contains images of fashion products from ten categories, such as T-shirts, trousers and pullovers. We used all images in it. ETL and NDL are image collections of Japanese characters. The ETL dataset consists of all images of katakana characters in the ETL1 dataset of the ETL Character Database, an image collection of handwritten and machine-printed letters and digits collected by AIST, Japan. The NDL dataset consists of images of hiragana characters from the image databases at the website of the National Diet Library of Japan.^{Footnote 1} The character images were extracted from documentary materials published from 1900 to 1977, which are available in the National Diet Library Digital Collections.

Except for NDL, all of the images in the datasets were grayscale. Some of the NDL images were RGB; we transformed them into grayscale images by using the MATLAB command rgb2gray. The sizes of the images in each dataset were equal. The n grayscale images in a dataset were represented as vectors $\varvec{a}_1, \ldots , \varvec{a}_n$. Here, given an image size of $h \times w$ pixels, an image vector $\varvec{a}_i$ is $(h \times w)$-dimensional and the value of each element is a grayscale intensity at the corresponding pixel. Table 1 summarizes the dimension d of the image vectors, the number n of images, and the number k of classes in the dataset.

Table 1 Image datasets used in the experiments

Full size table

Adjacency matrices were formed from the image vectors. The construction was based on the procedure suggested in Section 2.2 of von Luxburg (2007). Let $\varvec{a}_1, \ldots , \varvec{a}_n \in {\mathbb{R}}^d$ be image vectors in a dataset, and assume that $\varvec{a}_1, \ldots , \varvec{a}_n$ are nonnegative. This assumption is a natural one, as the value of each element represents a grayscale intensity. The similarity between $\varvec{a}_i$ and $\varvec{a}_j$ is evaluated using

$$\begin{aligned} s(\varvec{a}_i, \varvec{a}_j) = \frac{\varvec{a}_i^\top \varvec{a}_j}{\Vert \varvec{a}_i\Vert _2 \Vert \varvec{a}_j\Vert _2}. \end{aligned}$$

The value ranges from 0 to 1. It is close to 1 if $\varvec{a}_i$ is nearly parallel to $\varvec{a}_j$, while it is close to 0 if $\varvec{a}_i$ and $\varvec{a}_j$ are well spread. The EMNIST-Balanced dataset contains over one-hundred-thousand images. If we computed the similarity values for all pairs of the image vectors and constructed an adjacency matrix using all of them, the matrix would take up a large amount of memory. Hence, we replaced relatively small similarity values for some pairs of image vectors with zero. Specifically, we chose p image vectors with the highest similarity to $\varvec{a}_i$ and built from them the set $N_p(\varvec{a}_i)$. We then constructed an $n \times n$ symmetric matrix $\varvec{W}$ such that the (i, j)th element $w_{ij}$ is

$$\begin{aligned} w_{ij} = \left\{ \begin{array}{ll} s(\varvec{a}_i, \varvec{a}_j) &{} \quad \text{ if } \ \varvec{a}_i \in N_p(\varvec{a}_j) \ \text{ or } \ \varvec{a}_j \in N_p(\varvec{a}_i), \\ 0 &{} \quad \text{ otherwise }. \end{array}\right. \end{aligned}$$

In the subsequent discussion, we will call p the neighbor size and $N_p(\varvec{a}_i)$ the p-nearest neighbor set of $\varvec{a}_i$.

The experiments used two measures, AC and NMI, to evaluate how closely the clusters found by the algorithms matched the classes of each dataset. Here, recall that AC stands for accuracy and NMI for normalized mutual information. We are given n images indexed by integers $1, \ldots , n$ in a dataset that have been manually classified into k classes $C_1, \ldots , C_k \subset \{1, \ldots , n\}$. For clusters $T_1, \ldots , T_k$ returned by an algorithm, we take a permutation $\sigma : \{1,\ldots ,k\} \rightarrow \{1,\ldots ,k\}$ that maximizes $\sum _{i=1}^{k} |C_i \cap T_{\sigma (i)} |$. AC is defined by

$$\begin{aligned} \frac{1}{n} \sum _{i=1}^{k} |C_i \cap T_{\sigma (i)} | \end{aligned}$$

for the $\sigma$. Note that the problem of finding such a permutation $\sigma$ is an assignment problem, and it is easily solvable. Let $\Gamma _1 = \{C_1, \ldots , C_k\}$ and $\Gamma _2 = \{T_1, \ldots , T_k\}$. NMI is defined by

$$\begin{aligned} \frac{2 \cdot I(\Gamma _1; \Gamma _2)}{H(\Gamma _1) + H(\Gamma _2)}. \end{aligned}$$

Here, $I(\Gamma _1; \Gamma _2)$ is the mutual information of $\Gamma _1$ and $\Gamma _2$, and $H(\Gamma _1)$ and $H(\Gamma _2)$ are the entropies of $\Gamma _1$ and $\Gamma _2$. For details, we refer readers to Section 16.3 of the textbook (Manning et al., 2008). The values of AC and NMI range from 0 to 1. A higher value indicates a higher degree of matching between clusters and classes. In particular, if there is a permutation $\sigma$ such that $C_i = T_{\sigma (i)}$ for every $i=1, \ldots , k$, then AC and NMI are each 1. Besides AC and NMI, we measured the MCC and the elapsed time of the algorithm.

For each dataset, we constructed adjacency matrices using p-nearest neighbor sets by changing the neighbor size $p \in \{10, 100, 200, 300, 400, 500\}$. In KSC and GNMF, we repeated the k-means++ algorithm 100 times for each input.

Table 2 AC, NMI and MCC of algorithms for each dataset in case of $p=300$: AC (top), NMI (middle) and MCC (bottom)

Full size table

Table 3 Elapsed time of algorithms in seconds for each dataset in case of $p=300$. The columns labeled “KSC” and “GNMF” list the averages over the 100 trials of the k-means++ algorithm

Full size table

Tables 2 and 3 show the experimental results for $p = 300$: AC, NMI and MCC from top to bottom in Table 2 and elapsed time in Table 3. In KSC and GNMF, we repeated the k-means++ algorithm 100 times for each input. Table 2 lists the average, worst, and best values of ACs, NMIs, and MCCs in the columns labeled “Average”, “Worst”, and “Best”. Table 3 lists the averages of the elapsed time in the columns labeled “KSC’ and “GNMF”. By comparing the measurements of ELLI with the averages of the measurements of KSC and GNMF, we can make the following observations.

ELLI and KSC outperform GNMF in terms of AC and NMI, except in the case of KSC on Fashion-MNIST. The AC and NMI of ELLI are higher than the average AC and NMI of KSC. However, the differences are small, except for those on NDL.
ELLI and KSC outperform GNMF in terms of MCC. The MCC of ELLI is lower than the average MCC of KSC, except for that on ETL.
ELLI and KSC are faster than GNMF. The elapsed time of ELLI is slightly longer than the average elapsed time of KSC.

Hence, the experimental results imply that the AC and NMI of ELLI can reach at least the average AC and NMI of KSC. This is apparently an advantage of ELLI over KSC. After performing the k-means method multiple times in KSC, it is necessary to appropriately select one of the outputs, which may not be an easy task. In fact, Table 2 shows that there is a gap between the worst and best values of ACs and NMIs of KSC. The experimental results on synthetic and real data also imply that ELLI will often outperform KSC in terms of MCC.

The experimental results for the cases other than $p=300$ show a similar tendency. Figure 4 plots the ACs, NMIs, and MCCs of the algorithms run on each dataset for neighbor sizes $p \in \{10, 100, 200, 300, 400, 500\}$. We see that, even if p changes in the range, ELLI outperforms KSC in terms of AC and NMI on NDL and is about equal to KSC in terms of AC and NMI on the other dataset. Moreover, the MCC of ELLI is lower than the average MCC of KSC, except for that on ETL.

Remark 3

The previous version of this paper posted on arXiv chosen an incorrect value of $k = 46$ for EMNIST-Balanced during the experiments. The current paper reports the experimental results for the dataset obtained with the correct value of $k = 47$.

7 Discussion and future research

There remain issues that need to be addressed. In Theorem 2, we showed the range of $\Upsilon$ to ensure that the output of ELLI coincides with an optimal k-way partition of a graph. It is unclear whether the range can be further improved. In the experiments on the image datasets, we experienced that ELLI did not always achieve significantly higher AC and NMI than those of KSC even when the conductance of the clusters returned by ELLI was lower than that of KSC. The main cause of this unfavorable situation could be that clusters with low conductance in a graph do not sufficiently capture the characteristics of manually assigned dataset classes. This situation may be ameliorated by revising the way of constructing adjacency matrices from image vectors. We close this paper by suggesting directions of study for future research.

As explained in Sect. 5, it should be possible to replace the use of an ellipsoid in Step 2 of ELLI with an algorithm for solving separable NMF problems. In particular, SPA and SNPA are fit for the purpose. It would be interesting to explore whether a similar result as Theorem 2 can be obtained for using either of them instead of an ellipsoid.
Spectral clustering needs to construct adjacency matrices from datasets. The construction for large-scale datasets is time and memory consuming. To address the issue, the authors of Chen and Cai (2011); Huang et al. (2020) proposed the use of a bipartite graph for representing the similarities between data. It would be interesting to investigate the performance of ELLI incorporated with this technique.
We believe that ELLI works on hyperspectral unmixing problems. This problem asks one to find the spectra of constituent materials, called endmembers, from a hyperspectral image. It can be cast as a graph-based clustering problem. Our preliminary experiments often indicated that the index set I found by Step 2 of ELLI provides a good estimate of endmembers.

Notes

http://lab.ndl.go.jp/cms/.

References

Aloise, D., Deshpande, A., Hansen, P., & Popat, P. (2009). NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75, 245–248.
Article Google Scholar
Araújo, U. M. C., Saldanha, B. T. C., Galvão, R. K. H., Yoneyama, T., Chame, H. C., & Visani, V. (2001). The successive projections algorithm for variable selection in spectroscopic multicomponent analysis. Chemometrics and Intelligent Laboratory Systems, 57(2), 65–73.
Article Google Scholar
Arora, S., Ge, R., Kannan, R., & Moitra, A. (2012). Computing a nonnegative matrix factorization—provably. In: Proceedings of the 44th symposium on theory of computing (STOC) (pp. 145–162).
Arora, S., Rao, S., & Vazirani, U. (2009). Expander flows, geometric embeddings and graph partitioning. Journal of the ACM, 56(2), 5:1–5:37.
Article MathSciNet Google Scholar
Arthur, D., & Vassilvitskii, S. (2007). $k$-means++: The advantages of careful seeding. In: Proceedings of the 18th annual ACM-SIAM symposium on discrete algorithms (SODA) (pp. 1027–1035).
Bach, F. R., & Jordan, M. I. (2003). Learning spectral clustering. In: Advances in neural information processing systems 16 (NIPS) (pp. 305–312).
Cai, D., He, X., Han, J., & Huang, T. S. (2011). Graph regularized nonnegative matrix factorization for data representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8), 1548–1560.
Article Google Scholar
Chen, X., & Cai, D. (2011). Large scale spectral clustering with landmark-based representation. In: Proceedings of the 25th AAAI conference on artificial intelligence (pp. 313–318).
Chung, F. R. K. (1997). Spectral graph theory. American Mathematical Society.
MATH Google Scholar
Cohen, G., Afshar, S., Tapson, J., & van Schaik, A. (2017). EMNIST: An extension of MNIST to handwritten letters. arXiv:1702.05373.
Dey, T. K., Peng, P., Rossi, A., & Sidiropoulos, A. (2019). Spectral concentration and greedy $k$-clustering. Computational Geometry: Theory and Applications, 76, 19–32.
Article MathSciNet Google Scholar
Donath, W. E., & Hoffman, A. J. (1973). Lower bounds for the partitioning of graphs. IBM Journal of Research and Development, 17(5), 420–425.
Article MathSciNet Google Scholar
Donoho, D., & Stodden, V. (2003). When does non-negative matrix factorization give a correct decomposition into parts? In: Proceedings of advances in neural information processing systems 16 (NIPS) (pp. 1141–1148).
Fiedler, M. (1973). Algebraic connectivity of graphs. Czechoslovak Mathematical Journal, 23(2), 298–305.
Article MathSciNet Google Scholar
Gharan, S. O., & Trevisan, L. (2014). Partitioning into expanders. In: Proceedings of the 25 annual ACM-SIAM symposium on discrete algorithms (SODA) (pp. 1256–1266).
Gillis, N. (2014). Successive nonnegative projection algorithm for robust nonnegative blind source separation. SIAM Journal of Imaging Science, 7(2), 1420–1450.
Article MathSciNet Google Scholar
Gillis, N., & Vavasis, S. A. (2014). Fast and robust recursive algorithms for separable nonnegative matrix factorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(4), 698–714.
Article Google Scholar
Golub, G. H., & Loan, C. F. V. (2013). Matrix computation (4th ed.). The Johns Hopkins University Press.
MATH Google Scholar
Huang, D., Wang, C.-D., Wu, J.-S., Lai, J.-H., & Kwoh, C.-K. (2020). Ultra-scalable spectral clustering and ensemble clustering. IEEE Transactions on Knowledge and Data Engineering, 32(6), 1212–1226.
Article Google Scholar
Kannan, R., Vempala, S., & Vetta, A. (2004). On clusterings: Good, bad and spectral. Journal of the ACM, 51(3), 497–515.
Article MathSciNet Google Scholar
Khachiyan, L. G. (1996). Rounding of polytopes in the real number model of computation. Mathematics of Operations Research, 21(2), 307–320.
Article MathSciNet Google Scholar
Kolev, P., & Mehlhorn, K. (2016). A note on spectral clustering. In: 24th annual European symposium on algorithms (ESA 2016) (Vol. 57, pp. 57:1–57:14).
Kolev, P., & Mehlhorn, K. (2018). Approximate spectral clustering: Efficiency and guarantees. arXiv:1509.09188v5.
Kumar, P., & Yildirim, E. A. (2005). Minimum-volume enclosing ellipsoids and core sets. Journal of Optimization Theory and Applications, 126(1), 1–21.
Article MathSciNet Google Scholar
Kwok, T. C., Lau, L. C., Lee, Y. T., Gharan, S. O., & Trevisan, L. (2013). Improved cheeger’s inequality: Analysis of spectral partitioning algorithms through higher order spectral gap. In: Proceedings of the 45th annual ACM symposium on theory of computing (STOC) (pp. 11–20).
Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
Article Google Scholar
Lee, J. R., Gharan, S. O., & Trevisan, L. (2012). Multi-way spectral partitioning and higher-order cheeger inequalities. In: Proceedings of the 44th annual ACM symposium on theory of computing (STOC) (pp. 1117–1130).
Lei, J., & Rinaldo, A. (2015). Consistency of spectral clustering in stochastic block models. The Annals of Statistics, 43(1), 215–237.
MathSciNet MATH Google Scholar
Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137.
Article MathSciNet Google Scholar
Mahajan, M., Nimbhorkar, P., & Varadarajan, K. (2009). The planar $k$-means problem is NP-hard. In: Proceedings of the 3rd international workshop on algorithms and computation (WALCOM) (pp. 274–285).
Manning, C. D., Raghavan, P., & Schuetze, H. (2008). Introduction to information retrieval. Cambridge University Press.
Book Google Scholar
Matula, D. W., & Shahrokhi, F. (1990). Sparsest cuts and bottlenecks in graphs. Discrete Applied Mathematics, 27(1–2), 113–123.
Article MathSciNet Google Scholar
McSherry, F. (2001). Spectral partitioning of random graphs. In: Proceedings of the 42nd IEEE symposium on foundations of computer science (FOCS) (pp. 529–537).
Mizutani, T. (2014). Ellipsoidal rounding for nonnegative matrix factorization under noisy separability. Journal of Machine Learning Research, 15, 1011–1039.
MathSciNet MATH Google Scholar
Mizutani, T. (2015). Spectral clustering by ellipsoid and its connection to separable nonnegative matrix factorization. arXiv:1503.01531.
Mizutani, T. (2020). Improved analysis of spectral algorithm for clustering. Optimization Letters. https://doi.org/10.1007/s11590-020-01639-3.
Article Google Scholar
Ng, A. Y., Jordan, M., & Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. In: Advances in neural information processing systems 14 (NIPS) (pp. 849–856).
Ostrovsky, R., Rabani, Y., Schulman, L. J., & Swamy, C. (2012). The effectiveness of Lloyd-type methods for the $k$-means problem. Journal of the ACM, 59(6).
Peng, R., Sun, H., & Zanetti, L. (2015). Partitioning well-clustered graphs: Spectral clustering works! In: Proceedings of the 28th conference on learning theory (COLT) (Vol. 40, pp. 1423–1455).
Peng, R., Sun, H., & Zanetti, L. (2017). Partitioning well-clustered graphs: Spectral clustering works!. SIAM Journal on Computing, 46(2), 710–743.
Article MathSciNet Google Scholar
Rohe, K., Chatterjee, S., & Yu, B. (2011). Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics, 39(4), 596–616.
Article MathSciNet Google Scholar
Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888–905.
Article Google Scholar
Sinop, A. K. (2016). How to round subspaces: A new spectral clustering algorithm. In: Proceedings of the 27th annual ACM-SIAM symposium on discrete algorithms (SODA) (pp. 1832–1847).
Toh, K.-C., Todd, M. J., & Tütüncü, R. H. (1999). SDPT3–A MATLAB software package for semidefinite programming. Optimization Methods and Software, 11, 545–581.
Article MathSciNet Google Scholar
von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395–416.
Article MathSciNet Google Scholar
Xiao, H., Rasul, K., & Vollgraf, R. (2017). Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747.

Download references

Acknowledgements

The author would like to thank the anonymous reviewers for their helpful comments and suggestions, and in particular one reviewer for providing Corollary 1. This research was supported by the Japan Society for the Promotion of Science (JSPS KAKENHI Grant Numbers 15K20986, 26242027).

Author information

Authors and Affiliations

Department of Mathematical and Systems Engineering, Shizuoka University, 3-5-1 Johoku, Naka, Hamamatsu, 432-8561, Japan
Tomohiko Mizutani

Authors

Tomohiko Mizutani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tomohiko Mizutani.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Editors: Ira Assent, Carlotta Domeniconi, Aristides Gionis, Eyke Hüllermeier.

Appendix: Proof of Lemma 2

We use the following notation. Let $\varvec{a}= [a_1, \ldots , a_k]^\top \in {\mathbb{R}}^k$. For $i \in \{1, \ldots , k\}$, we denote by $\varvec{a}_{{\setminus } i}$ a subvector obtained by removing the ith element $a_i$ from $\varvec{a}$, i.e.,

$$\begin{aligned} \varvec{a}_{{\setminus } i} = [a_1, \ldots , a_{i-1}, a_{i+1}, \ldots , a_k]^\top \in {\mathbb{R}}^{k-1}. \end{aligned}$$

In a similar way, for $i,j \in \{1, \ldots , k\}$ with $i \ne j$, we denote by $\varvec{a}_{{\setminus } i, j}$ a subvector obtained by removing the ith element $a_i$ and the jth element $a_j$ from $\varvec{a}$. The proof of the lemma often makes use of the inequality,

$$\begin{aligned} \Vert \varvec{a}_{{\setminus } i} \Vert _2 \le \sqrt{1 - \xi ^2}, \end{aligned}$$

(14)

for $\varvec{a}\in C(i, \xi )$. It is easy to verify. Since $\Vert \varvec{a}_{{\setminus } i}\Vert _2^2 = \Vert \varvec{a}\Vert _2^2 - a_i^2 = 1 - a_i^2$, we have $a_i^2 = 1 - \Vert \varvec{a}_{{\setminus } i} \Vert _2^2$. Also, since $0 \le \xi \le a_i$, we have $\xi ^2 \le a_i^2$. This leads to $\Vert \varvec{a}_{{\setminus } i} \Vert _2^2 \le 1 - \xi ^2$, which means inequality (14).

Proof of Lemma 2

We prove (a). Let us write $\varvec{a}^\top \varvec{b}$ as

$$\begin{aligned} \varvec{a}^\top \varvec{b}= a_i b_i + \varvec{a}_{{\setminus } i}^\top \varvec{b}_{{\setminus } i}. \end{aligned}$$

Since $a_i \ge \xi \ge 0$ and $b_i \ge \xi \ge 0$, we have $a_i b_i \ge \xi ^2$. It follows from the Cauchy-Schwarz inequality and inequality (14) that

$$\begin{aligned} |\varvec{a}_{{\setminus } i}^\top \varvec{b}_{{\setminus } i}| \le \Vert \varvec{a}_{{\setminus } i} \Vert _2 \Vert \varvec{b}_{{\setminus } i} \Vert _2 \le 1 - \xi ^2. \end{aligned}$$

Consequently, we have $\varvec{a}^\top \varvec{b}\ge 2 \xi ^2 - 1$. Next, we prove (b). Let us write $\varvec{a}^\top \varvec{b}$ as

$$\begin{aligned} \varvec{a}^\top \varvec{b}= a_i b_i + a_j b_j + \varvec{a}_{{\setminus } i, j}^\top \varvec{b}_{{\setminus } i, j}. \end{aligned}$$

Since $\Vert \varvec{a}\Vert _2 = 1$, we have $a_i \le 1$. Also, from inequality (14), we have $b_i^2 \le \Vert \varvec{b}_{{\setminus } j} \Vert _2^2 \le 1 - \xi ^2$. This leads to $b_i \le \sqrt{1 - \xi ^2}$. Hence, $a_i b_i \le \sqrt{1-\xi ^2}$ holds. Of course, $a_j b_j \le \sqrt{1-\xi ^2}$ holds in the same way. It follows from the Cauchy-Schwarz inequality and inequality (14) that

$$\begin{aligned} |\varvec{a}_{{\setminus } i,j}^\top \varvec{b}_{{\setminus } i,j}| \le \Vert \varvec{a}_{{\setminus } i,j} \Vert _2 \Vert \varvec{b}_{{\setminus } i,j} \Vert _2 \le \Vert \varvec{a}_{{\setminus } i} \Vert _2 \Vert \varvec{b}_{{\setminus } j} \Vert _2 \le 1 - \xi ^2. \end{aligned}$$

Consequently, we have $\varvec{a}^\top \varvec{b}\le -\xi ^2 + 2\sqrt{1-\xi ^2}+1$.□

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mizutani, T. Convex programming based spectral clustering. Mach Learn 110, 933–964 (2021). https://doi.org/10.1007/s10994-020-05940-1

Download citation

Received: 10 April 2020
Revised: 16 September 2020
Accepted: 22 December 2020
Published: 14 April 2021
Issue Date: May 2021
DOI: https://doi.org/10.1007/s10994-020-05940-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Convex programming based spectral clustering

Abstract

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

1 Introduction

1.1 Our contributions

Remark 1

2 Preliminaries

2.1 Graphs and Laplacians

2.2 Conductance

2.3 Spectral clustering algorithm

3 Algorithm and analysis results

3.1 Outline of ELLI

Theorem 1

3.2 Description of the algorithm

3.3 Results of performance analysis

Theorem 2

Corollary 1

3.4 Related work

4 Analysis of the algorithm

4.1 Step 2

Proposition 1

Theorem 3

Lemma 1

Proof of Theorem 3

4.2 Step 3

Theorem 4

Proposition 2

Proof

Lemma 2

Proposition 3

Proof

Proof of Theorem 4

4.3 Proofs of Theorem 2 and Corollary 1

Proof of Theorem 2

Proof of Corollary 1

5 Connection with computing separable NMFs

Remark 2

6 Experiments

6.1 Synthetic data

6.2 Real data

Remark 3

7 Discussion and future research

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: Proof of Lemma 2

Appendix: Proof of Lemma 2

Proof of Lemma 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation