Introduction

Spectral clustering has become one of the most popular researching fields in the last decades, it has shown impressive results in the practical applications [19, 20, 28]. The main idea of spectral clustering is to cluster the data into different groups by using the spectrum of the similarity matrix which captures the structure of the data [33]. Hence, the first step is to construct a similarity matrix, which plays an important role in its performance. Therefore, in this paper, we introduce multiobjective evolutionary algorithm into spectral clustering to construct the similarity matrix, and propose a Pareto ensemble-based spectral clustering framework (PESC).

Conventional similarity matrix construction

In recent years, many methods, including traditional methods [44] and other methods [2, 20, 29, 48], have been proposed to build an appropriate similarity matrix. The similarity matrix aims to model the relationship of each sample of a given dataset. Consider a dataset \(\varvec{A}=\{\varvec{a}_1,..., \varvec{a}_i,..., \varvec{a}_N\}\) with N samples, \(\varvec{S}\in {\mathbb {R}}^{N\times N}\) is the similarity matrix, where \(s_{ij}\) measures the similarity between sample \(\varvec{a}_i\) and \(\varvec{a}_j\). Take the neighborhood of a sample \(\varvec{a}_i\) in \(\varvec{S}\) into consideration, the traditional methods can be summarized as follows [20]:

  1. 1.

    Fully connected similarity matrix [12, 44]. This similarity construction method takes each sample as the neighbor of the other samples, and a nonzero similarity value is assigned to \(s_{ij}\), resulting a full matrix.

  2. 2.

    k-nearest neighbor (kNN) similarity matrix [4, 21]. For a sample \(\varvec{a}_i\), all the similarity entries \(s_{ij}\quad (j=1,...,N)\) are set to 0 except its nearest k neighbors. But the similarity matrix constructed by this method is not symmetric, there are two strategies to solve it. The first one is to set \(s_{ij}=s_{ji}\ne 0\) when sample \(\varvec{a}_i\) belongs to sample \(\varvec{a}_j\)’s k-nearest neighbors or sample \(\varvec{a}_j\) belongs to sample \(\varvec{a}_i\)’s k-nearest neighbors. Another way is mutual k-nearest neighbor similarity construction method, it establishes \(s_{ij}=s_{ji}\ne 0\) when both sample \(\varvec{a}_i\) and sample \(\varvec{a}_j\) belong to each other’s k-nearest neighbors.

  3. 3.

    \(\epsilon \)-neighborhood similarity matrix [12, 44]. If the distance between the sample \(\varvec{a}_i\) and \(\varvec{a}_j\) is less than a threshold \(\epsilon \), then \(s_{ij}=1\), otherwise \(s_{ij}=0\) .

In these methods, the Gaussian Kernel function \(K(\varvec{a}_i, \varvec{a}_j)=\mathrm{exp}(\frac{-\mathrm{dist}_{ij}^2}{2\sigma ^2})\) is a typical measurement to calculate the value of nonzero similarity entry \(s_{ij}\). In this Kernel function, \(\mathrm{dist}_{ij}\) represents the distance between sample \(\varvec{a}_i\) and \(\varvec{a}_j\) (usually the Euclidean distance is adopted), \(\sigma \) is the scale factor to control the width of the neighborhood which plays an important part in inducing meaningful neighborhood structure. Both k-nearest neighbor similarity matrix and \(\epsilon \)-neighborhood similarity matrix are sparse matrices, resulting that they have a strict demand for the value of parameter k or \(\epsilon \). But it is a difficult task to determine the value of these parameters. In recent years, there are some approaches that learn a similarity graph with exactly K (the number of clusters) connected components for graph-based clustering, such as [30, 31]. To get this K connected similarity graph, [31] proposed a method that constructing an initial graph with k neighbors and calculating the value of these nonzero entries with a parameter-free approach, however, the value of k still needs to setup in this literature. In both [30, 31], a common question is that they should solve the optimization problem with the following form: \(\min _{\varvec{x}} f(\varvec{x})+\gamma g(\varvec{x})\), which \(\gamma \) is a parameter sensitive to a different dataset.

The multiobjective evolutionary algorithm based clustering

In recent years, multiobjective evolutionary algorithms (MOEAs) have attracted a lot of researchers’ attention for their wide application in real world [5, 7]. Many state-of-the-art MOEAs, such as NSGA-II [8], SPEA2 [53], IMOEA [41], and MOEA/D [15, 51], have been proposed to handle multiobjective optimization problems (MOPs). They also have been used in the field of data mining successfully, such as clustering [10, 18], classification [11, 25] and feature selection [27, 45]. MOEA-based clustering algorithms focus on using multiple criterions, such as the cluster validity indexes, to capture the characteristics of the data [16, 26]. They vary from different aspects, including the type of MOEAs, the encoding schemes, the objective functions, the final Pareto optimal solution selection strategies, and even the evolutionary operators. A detailed survey on this issue has been discussed in the literature [26]. To better understand the proposed framework PESC, we will present some related work in the aspect of the encoding schemes, the objective functions, and the final Pareto optimal solution selection strategies.

Considering the adopted encoding schemes, an experimental evaluation of cluster representations (prototype-based representation, label-based representation, and graph-based representation) for multi-objective evolutionary clustering has been done in [9]. The advantage of prototype-based representation is that the encoding length is small so that it takes less time to apply the evolutionary operators, nevertheless, this representation scheme is good at dealing with round-shaped data. On the contrary, label-based representation and graph-based representation are not strict to the underlying structure of the clusters, but their coding length is high especially for large-scale dataset.

The cluster validity indexes are usually adopted as the objective functions in MOEAs. The cluster validity indexes, such as overall cluster deviation [14], cluster connectedness [14], Jm [47], XB [50], silhouette index [37], the intracluster entropy H [35, 36], and cluster separation Sep(C) [35, 36], are the most commonly used objective functions. We have known that the quality of the clustering quite depends on the objective functions adopted, but there is no available analysis on the effect of different objective function until now.

The final Pareto optimal solution selection strategy can be classified into three categories: the independent objective-based approach, the knee-based approach, and the ensemble-based approach. For the first approach, another cluster validity different from objective functions is used to measure the quality of the nondominated solutions. The one with the best quality on this index is selected as the final solution, such approaches can be seen in [22, 23]. The knee-based approach aims to select a solution, which the change of one objective value induces the maximum change in the other one, from the Pareto Front (PF) as the final solution, such as MOCK [13, 14], StEMO [17].

In this paper, we focus on the ensemble-based final Pareto optimal solution selection strategy. Ensemble learning has been proved to be effective against solving machine learning problems, especially in the practical application, such as SAR image segmentation [52]. As for ensemble-based clustering, there is no explicit correspondence between the labels delivered by different clustering. Its difficulty lies in finding a consensus partition from multiple algorithms or partitions [42]. In [40], three cluster ensemble algorithms, which are cluster-based similarity partitioning algorithm (CSPA), hypergraph partitioning algorithm (HGPA), and meta-clustering algorithm (MCLA), were introduced to complete the ensemble task. Since MOEA can generate a set of nondominated solutions, it provides a suitable way for ensemble learning. The ensemble-based approach assumes that all the nondominated solutions contain some useful information on detecting the structure of the dataset, therefore, all these solutions should be integrated to obtain a single clustering output. The three cluster-ensemble approaches CSPA, HGPA, and MCLA can be easily applied to MOEA clustering problems. A few researchers have also proposed some MOEA-based cluster-ensemble algorithms. In [24], a multiobjective genetic algorithm-based approach for fuzzy clustering of categorical data which simultaneously optimizes fuzzy compactness and fuzzy separation of the clusters (\(MOGA(\pi ,sep)\)) is proposed. To obtain the final clustering result, a majority voting strategy is implemented on the nondominated solutions to select the training samples. In practical application, the combination of knee-based approach and ensemble-based approach for recurrent neural networks is successfully used in the prediction of computational fluid dynamic simulations [39] and image identification [1]. In these literatures, not all the individuals in the Pareto set are considered as suitable solutions, only the Pareto-optimal solutions around the knee-point are employed to implement the ensemble task. Besides that, MOEA is also applied to neural network ensembles successfully in [3] by simply combining all the classifiers obtained from MOEA to form the ensemble. In general, all the works have shown that the ensemble-based methods performs better than simply selecting one Pareto solution.

Motivation and contributions

As we know, the existed MOEA-based clustering algorithms are barely related to spectral clustering, except for CSPA. In our previous work [20], we have proposed a sparse representation based spectral clustering framework via MOEAs (denoted as SRMOSC), this work introduces multiobjective optimization into spectral clustering, and constructs the similarity matrix using a sparse representation approach by modeling spectral clustering as a constrained multiobjective optimization problem. Unfortunately, as we mentioned in that work, its space complexity is high, especially when solving large-scale problems. In addition, as mentioned in the above sections, the conventional similarity matrix construction methods usually suffers from parameter tuning difficulty. Motivated by these two aspects, we tried to tackle the above problems in this paper.

It is generally known that MOEA can generate a set of nondominated solutions for multiobjective optimization problem (MOP). Hence, if the similarity matrix construction problem can be regarded as a MOP and all the nondominated solution can participate in the construction of the similarity matrix, then this problem can be solved in a parallel way and it will be more time saving than the unparallel way. Inspired by this idea, we propose a Pareto ensemble based spectral clustering framework in this paper, whose main procedure falls into two phases, the main contributions are summarized as follows:

  1. 1.

    A “divide and conquer” strategy is proposed to solve the similarity matrix construction problem. In PESC, the main procedure is divided into two phases, in which phase I aims to detect the nonzero entries of the similarity matrix by using MOEA and phase II turns to determine the value of the nonzero entries of the similarity matrix by the ensemble strategies.

  2. 2.

    In phase I, we introduce dynamic multiobjective optimization to spectral clustering by adopting a similarity measurement and a specific designed diversity measurement as objective functions. In addition, a specific initialization is designed to speed up the convergence in this phase.

  3. 3.

    In phase II, three ensemble approaches are proposed to construct the similarity matrix and determine the value of the nonzero entries.

In contrast to the existed MOEA based spectral clustering, PESC makes a balance between the time cost and the clustering accuracy with the “divide and conquer” strategy. Compared with the conventional similarity matrix construction methods, PESC can automatically determine the neighbors of the similarity matrix. It should be noted that the contribution of this work that differs from other Pareto-ensemble algorithms are that a single estimated Pareto-optimal solution can not construct a similarity matrix in this paper. What’s more, using the Pareto-ensemble based framework to construct the similarity matrix results in a reduction of both the time and space complexity.

To give a clear introduction and analysis of PESC, the paper is structured as follows. The description of PESC is shown in “The description of PESC”, including its two main phases. Then the experiment results and analysis are presented in “Experiments and discussion”. At last, a conclusion remark of PESC is given in “Conclusion”.

The description of PESC

In spectral clustering, the key point is to construct a similarity matrix. Assume \(\varvec{A}=\{\varvec{a}_1, \varvec{a}_2,...,\varvec{a}_N\} \) is the given dataset with N samples and K clusters, \(\varvec{S} = \left( {\begin{array}{*{20}{c}} {{s_{11}}}&{} \ldots &{}{{s_{1N}}}\\ \vdots &{} \ddots &{} \vdots \\ {{s_{N1}}}&{} \cdots &{}{{s_{NN}}} \end{array}} \right) \) is the similarity matrix to be constructed, then the main procedure of the spectral clustering is as Algorithm 1.

figure a

PESC aims to construct the similarity matrix \(\varvec{S}\) with a Pareto ensemble approach, and we give a detailed description of it in this section. The main procedures (see Algorithm 2) are divided into two phases: phase I–the nonzero entries determination phase (Steps1–2) and phase II–Pareto ensemble based weight determination phase (Step3). Hence, this section will be divided into two sections to discuss each issue.

figure b

Phase I: Nonzero entries determination

Phase I consists of the initialization and the cycling loop of PESC. We distribute this phase into three sections to describe, including the mathematical description of PESC and the evolutionary operators.

Mathematical description of PESC

The proposed algorithm PESC takes advantage of the superiority of MOEA, which can generate a set of solutions, to find the possible nonzero entries of the similarity matrix. Assume that each individual can find one possible nonzero entry for each sample, the objective functions can be formulated as formulas (1) and (2):

$$\begin{aligned}&\min \;\left\{ {\begin{array}{*{20}{l}} {{f_1(\varvec{X})} = \frac{1}{N}\sum \limits _{i = 1}^N {\mathrm{sim}({\varvec{a}_i},{\varvec{a}_{{x_i}}})} }\\ {{f_2(\varvec{X})} = 1-\mathrm{DIV}(\varvec{X}) } \end{array}} \right. \nonumber \\&\mathrm{s.t.} \quad \varvec{X} = \{ {x_1},...,{x_i},...,{x_N}\},\nonumber \\&\qquad \; \, {x_i} \ne i,\nonumber \\&\qquad \;\, {x_i} \in \{ 1,2,...,N\}. \end{aligned}$$
(1)
$$\begin{aligned}&\mathrm{DIV}({\varvec{X}_i}) = \frac{1}{M}\sum \limits _{m = 1}^{M} {\mathrm{div}({\varvec{X}_i}} ,{\varvec{X}_m})\nonumber \\&\mathrm{div}(\varvec{X}_i, \varvec{X}_m) = \frac{1}{N}\sum \limits _{j = 1}^N {\mu \left( {x_j^i},{x_j^m}\right) } \nonumber \\&\mathrm{where} \quad \mu (c,d) = \left\{ {\begin{array}{*{20}{c}} {1,}&{}{c \ne d}\\ {0,}&{}{c = d} \end{array}} \right. \end{aligned}$$
(2)

where \(\varvec{X}=\{x_1,...,x_i,...,x_N \}\) is the decision vector to be optimized, M is the number of solutions in the current population. For any individual \(\varvec{X}=\{x_1,...,x_i,...,x_N \}\), the i-th gene \(x_i=t\) represents the sample \(\varvec{a}_t\) connects with the i-th sample \(\varvec{a}_i\), and we denote it as sample \(\varvec{a}_{x_i}\) in the following parts. The objective function \(f_1(\varvec{X})\) measures the average similarity between sample \(\varvec{a}_i\) and \(\varvec{a}_{x_i}\), where \(i=\{1,2,...,N\}\), and \(\mathrm{sim}(\varvec{a}, \varvec{b})\) represents the similarity between samples \(\varvec{a}\) and \(\varvec{b}\). Note that different similarity measurements can be adopted as the objective functions in real application, and we use the Euclidean distance in this paper. \(\mathrm{DIV}(\varvec{X})\) measures the diversity of the solution \(\varvec{X}\), \(f_2(\varvec{X})\) is a dynamic objective function since its value changes with the generation. Note that the constraint \(x_i\ne i\) should be satisfied, it means that only the similarity between different samples can be measured.

There have been a variety of literatures focus on the diversity of MOEAs [6, 46], it has pointed out in many literatures that the decision space should be taken into consideration [43] when estimate the diversity of solutions. The decision space-based diversity estimation usually significantly outperforms the objective space based strategies especially in solving multiobjective optimization problems, such as [34, 38, 49]. In this paper, the diversity measurement is not only adopted in the objective functions, but also in the diversity preservation strategy. According to formula (2), \(DIV(\varvec{X}_i)\) measures the diversity of the solution \(\varvec{X}_i\) in the decision space regarding the character of the similarity matrix construction problem. In theoretical analysis, \(DIV(\varvec{X}_i)\in [0,1]\). For a solution \(\varvec{X}_i\), the more it differs from other solutions, the higher its diversity is after this estimation.

Furthermore, unsupervised clustering without any guidance is usually unreliable [32], so it is necessary to extend the above unsupervised clustering model to semi-supervised learning model. Suppose sample \(\varvec{a}_i\) and \(\varvec{a}_j\) are labeled samples belonging to different clusters, \(\varvec{a}_i\in \varvec{C}_l\), \(\varvec{a}_j\in \varvec{C}_m\), \(l\ne m\). The semi-supervised spectral clustering can be modeled as Eq. (3) by adding some constraints involved by these labeled samples:

$$\begin{aligned} \begin{aligned} \min \;&\left\{ {\begin{array}{*{20}{l}} {{f_1(\varvec{X})} = \frac{1}{N}\sum \limits _{i = 1}^N {\mathrm{sim}({\varvec{a}_i},{\varvec{a}_{{x_i}}})} }\\ {{f_2(\varvec{X})} = 1-\mathrm{DIV}(\varvec{X}) } \end{array}} \right. \\ \mathrm{s.t.} \quad&\varvec{X} = \{ {x_1},...,{x_i},...,{x_N}\},\\&{x_i} \ne i,\\&{x_i} \ne j,\\&{x_j} \ne i,\\&{x_i} \in \{ 1,2,...,N\}.\\ \end{aligned} \end{aligned}$$
(3)

where the constraints \(x_i\ne j\), \(x_j\ne i\) guarantee that samples with different labels do not connect with each other.

Evolutionary operators

According to the objective functions, we use a graph-based representation encoding scheme in PESC. The coding length is N, and the size of the searching space is \((N-1)^N \). To obtain a set of high-quality initial individuals, a specific initialization scheme is designed (see Algorithm 3). This initialization scheme is based on the assumption that the corresponding entry in the similarity matrix prefers to be a nonzero entry if two samples locate in each other’s neighborhood. In each individual \(\varvec{X}_i=\{x^i_1,x^i_2,...,x^i_j,...x^i_N\}\), the value of \(x_j^i\) is decided by the Euclidean distance between sample \(\varvec{a}_j\) and other samples. In Algorithm 3, the parameter k controls the width of neighborhood, and its value can affect the convergence speed of PESC, it should be larger than log(N) according to [44].

figure c

The adopted crossover and mutation operators are the expansion of classic SBX (simulated binary crossover) and polynomial mutation, respectively. Both of them have good performance on the validity of PESC, which will be demonstrated in the experiment section.

Phase II: Pareto ensemble based weight determination

When the cycling loop finishes, we can obtain a set of nondominated solutions \(\{\varvec{X}_1,...,\varvec{X}_m,...,\varvec{X}_M \}\) where \(\varvec{X}_m=\{x_1^m,x_2^m,...,x_N^m\}\). The existed Pareto ensemble methods introduced in “Introduction” are not suited for PESC since that they are based on label representation or prototype representation but not the graph-based representation encoding scheme. Besides, the objective functions adopted in PESC make it inappropriate to use those cluster ensemble schemes. In this section, we focus on the task that how to convert these nondominated solutions into a sparse similarity matrix with the proposed methods.

In general, when we convert \(\{\varvec{X}_1,...,\varvec{X}_m,...,\varvec{X}_M \}\) to a sparse similarity matrix \(\varvec{S}\), the following rule should be satisfied: if \(x_i^m=j\), it means \(s_{ij}\) is a nonzero entry, otherwise, \(s_{ij}\) is a zero entry.

For example, if we have a population of 3 individuals \(\varvec{X}_1=\{2\ 1\ 4\ 3\ 4\}\), \(\varvec{X}_2=\{2\ 1\ 5\ 3\ 4\}\) and \(\varvec{X}_3=\{2\ 3\ 5\ 3\ 3\}\), then the nonzero entries of the converted similarity matrix are \(\{s_{12}, s_{21}, s_{23}, s_{34}, s_{35}, s_{43}, s_{53}, s_{54}\}\).

However, the value of each nonzero entry can not be determined by phase I, so we design the following three methods to complete this task.

  1. 1.

    Sparse Representation based single objective optimization (denoted as ES1 in the following experiment section):

    $$\begin{aligned}&\min \quad \Vert \varvec{A} \varvec{S}-\varvec{A}\Vert ^2_2\nonumber \\&\mathrm{s.t.}\quad s_{ij}>0,\quad \mathrm{if}\ x_i^m=j,\nonumber \\&\qquad \;\, s_{ij}=0,\quad \mathrm{else}. \end{aligned}$$
    (4)

    This strategy is based on sparse representation which is inspired by our previous work [20], it tries to minimize the reconstruction error \(\Vert \varvec{A} \varvec{S}-\varvec{A}\Vert ^2_2\), where \(\varvec{A}\) is not only the over-complete dictionary, but also the measurement matrix. When we use this strategy to obtain the similarity matrix \(\varvec{S}\), which entries are nonzero values have been decided according to the previous rule, implied by the constraints, so what we need to do in this task is to determine the weight of each nonzero entry. A lot of evolutionary algorithms can be adopted here, such as particle swarm optimization (PSO) and Genetic Algorithm (GA). In our experiment, we adopt simplified PSO to complete this task since its easy implementation and fast convergence.

  2. 2.

    Diversity ensemble (denoted as ES2 in the following experiment section):

    $$\begin{aligned}&{s_{ij}} = \frac{1}{M}\sum \limits _{m = 1}^M (1-DIV(\varvec{X}_m)) \cdot \delta \left( x_i^m, j\right) \nonumber \\&\delta \left( x_i^m, j\right) = \left\{ {\begin{array}{*{20}{l}} {1,}&{}\quad {x_i^m = j}\\ {0,}&{}\quad {x_i^m \ne j} \end{array}} \right. \end{aligned}$$
    (5)
    Table 1 The value of \(div(\varvec{X}_i,\varvec{X}_m)\)

    where M is the number of nondominated solutions. This similarity construction strategy is related to the diversity of each nondominated solution. Suppose \(s_{ij}\) is a nonzero entry, when we try to determine its value, we should find the nondominated solutions which \(x_i^m=j\) , and calculate their diversity according to (2) first. Then we take the average diversity value as the value of the nonzero entry \(s_{ij}\). Metaphorically speaking, \(\delta (x_i^m, j)\) are in charge of finding the solution which agrees to connect sample \(\varvec{a}_i\) with \(\varvec{a}_j\), and the solution which has better diversity contributes more to the final weights of \(s_{ij}\). Note that when we we construct \(\varvec{S}\), only the nondominated solutions are considered. Take the above \(\varvec{X}_1\), \(\varvec{X}_2\) and \(\varvec{X}_3\) for example here, the value of \(div(\varvec{X}_i,\varvec{X}_m)\) is calculated according to formula (2) and shown in Table 1, the diversity \(DIV(\varvec{X}_1)\), \(DIV(\varvec{X}_2)\), \(DIV(\varvec{X}_3)\) are \(\frac{4}{15}\), \(\frac{1}{5}\), \(\frac{1}{3}\), respectively. Therefore, the value of the nonzero entry \(s_{12}\) is decided by individual \(\varvec{X}_1\), \(\varvec{X}_2\), and \(\varvec{X}_3\) in accordance with formula (5), and \(s_{12}=\frac{1}{3}(DIV(\varvec{X}_1)+DIV(\varvec{X}_2)+DIV(\varvec{X}_3))=\frac{11}{15}\). In the same way, the similarity matrix can be obtained, and \(S= \left( {\begin{array}{*{20}{c}} 0 &{} \frac{11}{15} &{} 0 &{} 0 &{} 0\\ \frac{22}{45}&{} 0 &{} \frac{11}{45} &{} 0 &{} 0 \\ 0 &{} 0 &{} 0 &{} \frac{11}{45} &{} \frac{22}{45}\\ 0 &{} 0 &{} \frac{11}{15} &{} 0 &{} 0\\ 0 &{} 0 &{} \frac{22}{45} &{} \frac{11}{45} &{} 0 \end{array}} \right) \)

  3. 3.

    Kernel function (denoted as ES3 in the following experiment section):

    $$\begin{aligned} {s_{ij}} = \exp \left( \frac{{ - \mathrm{dist}_{ij}^2}}{{2{\sigma ^2}}}\right) \end{aligned}$$
    (6)

    where \(\mathrm{dist}_{ij}\) is the Euclidean distance between sample \(\varvec{a}_i\) and \(\varvec{a}_j\), \(\sigma \) is the scale factor to control the width of the neighborhood.

When we compare the performance of the previous strategies, all of them can get satisfying results, but ES1 is time-consuming in contrast with others, ES3 is easy to implement, but we need to choose a suitable value for the scale factor when dealing with different clustering problems. ES2 can get a balance between time complexity and clustering accuracy. The effect of these three ensemble strategies will be discussed in the experiment section.

According to the Step2 of the spectral clustering in Algorithm 1, the similarity matrix has to transform into a Laplacian matrix \(\varvec{L}\) which is symmetric and positive semi-definite. So the following methods mentioned in [20] is needed to complete the transformation:

$$\begin{aligned} \begin{array}{l} s_{ij}=\max \{s_{ij},s_{ji}\}\\ {d_{ij}} = \left\{ {\begin{array}{*{20}{l}} {0,}&{}\quad {i \ne j}\\ {\sum \limits _{m = 1}^N {{s_{im}}} },&{}\quad {i = j} \end{array}} \right. \\ \varvec{L}=\varvec{D}-\varvec{S} \end{array} \end{aligned}$$
(7)

\(\varvec{D}\in {\mathbb {R}}^{N\times N}\) is a diagonal matrix with the diagonal elements \(d_{ii}\) whose values equals to the sum of the i-th column of weight matrix \(\varvec{S}\).

Table 2 The character of adopted UCI datasets
Fig. 1
figure 1

The effect of parameter \(p_c\) and \(p_m\) (\(p_m=1-p_c\))on the clustering accuracy over 30 runs

Complexity analysis

  1. 1.

    Space complexity  Part of the memory in our algorithm is used to store the distance matrix among all the samples, and its space complexity is \(O(N^2)\). The rest are mainly used to store the population whose space complexity is \(O(pop\cdot N)\).

  2. 2.

    Time complexity  In this algorithm, the main time cost lies in the working cycle of the MOEA. All the time complexity of initialization, crossover, mutation, and evaluation is \(O(pop\cdot N)\). The time complexity of the diversity preservation strategy is \(O(n_{div}\cdot N \cdot pop)\), where \(n_{div}\) is the number of individuals to implement diversity preservation strategy. The time complexity of the updating of each generation relies on the MOEA adopted. In our experiment, PESC is carried out on the basis of NSGA-II, the time complexity of this step is of \(O((2pop)^2)\). The time complexity of Step4 (Algorithm 1) also depends on the first K eigenvectors calculation method adopted. Hence, the total time complexity of PESC is simplified as \(\max \{O(pop\cdot gen\cdot N), O(N^2)\}\).

Experiments and discussion

PESC is mainly carried out on the basis of NSGA-II, we show the experiment results and give a discussion on them in this section, including the investigation of parameters, the overall comparison between PESC and traditional or multiobjective clustering algorithms, and the detailed discussion of PESC.

The investigation of parameters

In the experiment section, 11 UCI supervised classification datasets are adopted to test the efficiency of PESC, the number of clusters is fixed to the real number of classes, the clustering accuracy is measured in terms of the percent of instances that are right classified, and all the experiments are obtained from 30 independent runs. The main character of these datasets are shown in Table 2, from which we can see that the number of samples ranges from 150 to 5000. Therefore, we take ‘wine’, ‘balance’, and ‘waveform’ as examples to make an investigation of parameter \(p_c\), \(p_m\), pop and gen, which are shown in Figs. 1, 2 and 3.

Figure 1 shows the clustering accuracy against the parameter \(p_c\) from 0.5 to 0.9 with interval 0.1 on dataset “wine”, balance” and “waveform”. We can observe that when we choose \(p_c=0.9\), three tested datasets get relatively higher clustering accuracy.

Fig. 2
figure 2

The effect of parameter pop on the clustering accuracy over 30 runs

On the effect of parameter pop, the scale of the dataset is a major factor to be considered. When the other parameters are fixed (\(p_c=0.9\), the evaluation time is set to 20,000), the effect of pop to the clustering accuracy with pop ranges from 10 to 100 is shown in Fig. 2. An observation, that the value of pop is related to the scale of the dataset, can be derived from this figure. When we obtain a satisfying clustering result, the larger the scale of the dataset is, usually the higher the value of parameter pop should be. According to literature [44], the number of the neighbors in kNN similarity construction method k is set to log(N) (N is the number of samples). In contrast with kNN, pop has the similar effect on controlling the size of the neighborhood, but the accuracy of PESC is insensitive to this parameter when the value is large enough as in the observation. When pop is set to a large enough value, the width of the neighborhood will be determined adaptively in the evolutionary process. According to the theoretical analysis and the experimental result, the value of pop should be larger than log(N). In our following experiment, its value is set to no less that \(\sqrt{N}\) to obtain a satisfying result. Take the scale of the tested datasets into consideration, pop is set to 100 according to our experiment (see Fig. 2).

Fig. 3
figure 3

The effect of parameter gen on the clustering accuracy over 30 runs

Figure 3 shows the clustering accuracy with respect to the parameter gen. In this figure, \(p_c\)=0.9, pop=100, gen is set to the maximum value 100, and the clustering accuracy is tested in every 5 generations from 0 to 100. We can see that the PESC has converged on the tested dataset since the 20-th iteration from Fig. 3.

In general, take the stability and time complexity of PESC into consideration, the value of parameters pop, gen, \(p_c\) and \(p_m\) (\(p_m=1- p_c\)) are set to 100, 100, 0.9, 0.1, respectively.

The effect of evolutionary operators

As described in “Evolutionary operators”, we designed a specific initialization strategy which is biased to the neighborhood samples. To test its efficiency, we make a comparison between the proposed initialization strategy and the random initialization (see Fig. 4). In the random initialization strategy, \(x_j^i\) (the i-th individual is represented as \(\varvec{X}_i=\{x_1^i,x_2^i,...,x_j^i,...,,x_N^i\}\)) is set to a random value selected from 1 to N. In Fig. 3, gen is set to 300, and the clustering accuracy is tested in every five generations, we can see that the convergence speed of random initialization scheme is low especially for large scale dataset. Analyzing Figs. 2 and 4, we can draw the conclusion that the proposed initialization scheme can generate a set of high-quality solutions and speed up the convergence rate.

Fig. 4
figure 4

The convergence analysis of the random initialization strategy

The similarity matrix comparison

Fig. 5
figure 5

Visualization of the similarity matrix (column 1), eigenvalues (column 2), and the eigenvectors (columns 3–5) obtained from PESC against other algorithms on dataset ‘wine’

In spectral clustering, the similarity matrix can reveal the relationship between samples. To evaluate the effect of the similarity matrix obtained from PESC, a visualization of the similarity matrix and the corresponding eigenvalues and eigenvectors are shown in Fig. 5. In this experiment, We take ‘wine’ for illustration since that it is very clear to see the relationship between the different clusters (‘wine’ has 178 samples, 13 attributes, and three categories, with samples 1–59 belonging to category ‘1’, samples 60–130 belonging to category ‘2’, and the rest are in category ‘3’). In Fig. 5, we compare PESC with the sparse spectral clustering algorithm SRMOSC and the traditional similarity matrix construction methods to discuss their performance. In the similarity matrix, the maximum and minimum weights are represented as the white and black pixels respectively for visualization purposes.

When we compare the three ensemble strategies proposed in PESC, we can see that they got similar performance, but the similarity matrices are quite different. The weights obtained by ES1 are lower than that obtained by ES2 and ES3. However, they have the similar properties that the number of nonzero entries is less than that of zero entries and the nonzero entries are mostly distributed as intra-class connections.

In contrast with SRMOSC, ES1 of PESC obtains similar similarity matrix with it. Both of the similarity matrices are sparse, and the values of the nonzero entries are lower compared with that obtained from the other similarity matrix construction methods. What’s more, the time complexity and space complexity of PESC is less than that of SRMOSC, and the corresponding experiment is shown in the next section.

In contrast with the conventional similarity construction methods (fully connected, kNN, mutual kNN, and \(\epsilon \)-neighborhood similarity matrix construction methods), the proposed algorithm PESC (row 1–row 3) and kNN (row 6) have better performance, but the rest algorithms cannot extract distinguishable information from the eigenvectors to carry out the clustering task exactly. We also can see that it is not available to use the eigenvectors of fully connected, mutual kNN, and \(\epsilon \)-neighborhood similarity matrices to classify the samples into exact clusters. Furthermore, it is a difficult and time-consuming task to decide the value of parameters in the conventional similarity construction methods, while PESC can determine the neighborhood of a sample automatically in the evolutionary process.

In general, we can obtain the following conclusions from Fig. 5: (1) the similarity matrices obtained from three proposed ensemble strategies are sparse matrices, and all of them can get satisfying clustering result; (2) the nonzero entries are mainly distributed as intra-class connections if the similarity matrix can provide effective distinguishable information.

Table 3 Clustering accuracy comparison obtained from pesc against moea based clustering algorithms including cluster ensemble algorithms on real-life datasets
Table 4 Clustering accuracy comparison obtained from pesc against conventional similarity construction-based algorithms on real-life datasets
Fig. 6
figure 6

Time evaluation of PESC against other algorithms

Experimental comparison between PESC and other algorithms

In this section, we will carry out an overall comparison between PESC and other algorithms, including the comparison between PESC and multiobjective clustering algorithms (SRMOSC and MOCK), the comparison between PESC and other cluster ensemble strategies or algorithms (\(MOGA(\pi ,sep)\) and CSPA), and the comparison between PESC and other conventional similarity matrix construction methods. Among these comparison algorithms, \(MOGA(\pi ,sep)\), CSPA, MOCK, and SRMOSC are all MOEA based clustering algorithms, MOCK is also regarded as a multiobjective cluster-ensemble algorithm since that its initialization is a hybrid of link-based clustering and k-means. Furthermore, the experiments are also extended to semi-supervised clustering with 10% and 20% labeled samples. All these experiments are tested on 11 UCI datasets over 30 independent runs to demonstrate the effectiveness of PESC.

The parameter setting of the comparison algorithms

The value of pop, gen, \(p_c\), \(p_m\) are set to 100, 100, 0.7, 0.3 in SRMOSC, MOCK according to [14, 20]. CSPA was carried out on the basis of multiobjective genetic algorithm (MOGA), and share the same objective function with \(MOGA(\pi , sep)\). So the parameter setting in these two algorithms are the same, which pop, gen, \(p_c\), \(p_m\) are set to 100, 100, 0.8 and 0.2, respectively [24]. When constructing the similarity matrix using fully connected kNN and mutual kNN construction methods, the Gaussian Kernel \(K(\varvec{x},\varvec{y})=\mathrm{e}^{-\frac{||\varvec{x}-\varvec{y}||^2}{2\sigma ^2}}\) is adopted to calculate the similarity. We carry out the experiment with the following values \(\{0.001, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 15\}\) for \(\sigma \), and choose the one that outputs the best clustering result as the final value. For kNN and mutual kNN methods, k is set to log(N). \(\epsilon \) is selected from the value {0.2, 0.3, 0.4, 0.5, 0.6} with the best clustering result in the \(\epsilon \)-neighborhood method.

The comparison result of unsupervised clustering algorithms

Tables 3 and 4 show the experiment result obtained from PESC against other MOEA based clustering algorithms and the conventional similarity matrix construction algorithms on real-life datasets with no labeled samples, respectively. The best result in these tables is written in bold. In Table 3, \(MOGA(\pi ,sep)\) and CSPA are MOEA based cluster ensemble algorithms. We can see that all the three Pareto ensemble strategies proposed in PESC have good performance and outperform other cluster ensemble algorithms, which gives an empirical support of the proposed Pareto ensemble strategies on the evaluated problems. In contrast with the sparse representation based multiobjective spectral clustering algorithm SRMOSC, PESC works better than SRMOSC on most datasets. Although PESC does not make a huge improvement on the clustering accuracy than SRMOSC, it is more time saving in contrast with it. When comparing PESC with multiobjective clustering algorithms MOCK, we can see that PESC outperforms MOCK on most of the tested detasets.

In addition, when we make a comparison between PESC and conventional similarity matrix based spectral clustering algorithms, PESC outperforms fully connected, mutual kNN, and \(\epsilon \)-neighborhood on all the tested datasets significantly, and performs better than kNN. In contrast with these traditional algorithms, PESC overcomes the difficulty in deciding the width of the neighborhood controlling by the values of parameters in conventional methods, such as k in kNN or mutual kNN, \(\epsilon \) in \(\epsilon \)-neighborhood, and the scale factor \(\sigma \) in fully connected similarity matrix construction method.

Table 5 Semi-supervised clustering with 10% labelled data obtained from pesc against other algorithms on real-life datasets
Table 6 Semi-supervised clustering with 20% labelled data obtained from pesc against other algorithms on real-life datasets

Take the time cost of all the algorithms into consideration, we give a time evaluation comparison between PESC and other algorithms (see Fig. 6) under the same experiment conditions. In this figure, the vertical axes represents the time cost which is plot in logarithmic scale. The proposed three ensemble strategies are represented by “PESC_ES1”, “PESC_ES2”, “PESC_ES3”, respectively. We can get the following conclusion according to this experiment considering the clustering accuracy: (1) the three proposed ensemble strategies obtained the similar clustering accuracy, however, ES1 is the most time-consuming in these three strategies since that it adopts a simplified PSO algorithm to solve the optimization problem. Hence, we prefer to adopt ‘ES2’ as the ensemble strategy considering both clustering accuracy and time complexity; (2) in contrast with the other MOEA based clustering algorithms (including SRMOSC, MOCK, \(MOGA(\pi ,sep)\), and CSPA), the time cost of PESC is comparative with \(MOGA(\pi ,sep)\) and CSPA, and quite lower than that of SRMOSC and MOCK. It demonstrated that the Pareto ensemble strategies not only make improvement in the aspect of time complexity which is a burden in many MOEA clustering algorithms, but also in the aspect of clustering accuracy; (3) when we compare PESC with the traditional similarity matrix construction methods, though the time complexity of PESC is higher than them, it overcomes the difficulty of parameter value determination in constructing the similarity matrix. In general, we can derive that PESC makes a balance between time complexity and clustering accuracy.

The comparison results of semi-supervised clustering algorithms

Tables 5 and 6 show the comparison results between PESC and other semi-supervised clustering algorithms with 10% and 20% labeled samples. SRMOSC transforms the semi-supervised information to a constraint to satisfy. Semi-MOCK handles semi-supervised information with an objective function called ‘Adjusted Rand Index’, which is an external measure of clustering quality. In traditional similarity matrix construction methods, all the entries of the labelled samples with the same label are set to the maximum value of its similarity matrix, and the corresponding entries with different labels are set to 0. From these tables, we can see that PESC, no matter which ensemble strategy is adopted, gains better performance than other multiobjective clustering algorithms on most of the tested datasets.

Conclusion

Multiobjective evolutionary algorithm has gained a growing concern in the field of data mining according to the recent literatures. A lot of related works are proposed in the last decades, but few researches focus on the topic of Pareto ensemble or spectral clustering. In this paper, we introduce multiobjective evolutionary algorithm into spectral clustering, and propose a Pareto ensemble based framework for spectral clustering. The main contributions can be summarized as follows.

First and the most important is that we model the similarity matrix construction task as a Pareto ensemble problem, it not only overcomes the difficulty in determining the value of parameters in the conventional methods, but also time-saving than the current existed multiobjective spectral clustering algorithm.

Second, to solve the proposed model effectively, we design a specific diversity preservation strategy for it and also employ this strategy in one of the proposed ensemble strategies. The experiment results derive that this diversity preservation strategy is effective and time-saving than the other ensemble strategies. In addition, a specific neighbor-biased initialization strategy is also proposed, it helps to improve convergence speed and reduce calculation amount during the evolutionary process.

Furthermore, we also extend this model to semi-supervised clustering by transforming the semi-supervised information to some constraints of MOPs, and detailed experiments show the efficiency in handling the related problems.

At last, PESC mainly focuses on the construction of the similarity matrix, so it is easy to extend it to the related field, such as subspace learning.

In general, PESC can obtain a satisfying result in not only the clustering accuracy, but also the time and space complexity in contrast with other multiobjective clustering algorithms. The goal of making a balance between time cost and clustering accuracy is achieved by employing MOEA and ensemble strategy in this paper.