Abstract

The similarity graphs of most spectral clustering algorithms carry lots of wrong community information. In this paper, we propose a probability matrix and a novel improved spectral clustering algorithm based on the probability matrix for community detection. First, the Markov chain is used to calculate the transition probability between nodes, and the probability matrix is constructed by the transition probability. Then, the similarity graph is constructed with the mean probability matrix. Finally, community detection is achieved by optimizing the NCut objective function. The proposed algorithm is compared with SC, WT, FG, FluidC, and SCRW on artificial networks and real networks. Experimental results show that the proposed algorithm can detect communities more accurately and has better clustering performance.

1. Introduction

With the development of information technology, the interactions among the complex systems of biology, sociology, and other fields are getting closer and closer. It is of great theoretical significance and practical value to obtain relevant information from real complex systems. According to graph theory, most real complex systems where their internal entities have rich associations can be abstracted into complex networks, such as neural networks, power networks, and social networks. In addition to the small-world and scale-free properties, complex networks have an extremely important community structure [1]. Community is a mesoscopic structure in which nodes from the same community are closely connected to each other, but nodes from different communities are sparsely connected. It is playing an important role in revealing the topological structure and functional features of complex networks. In recent years, community detection has been a popular research field for searching information, analysing function, and forecasting behaviour.

Community detection is a process of dividing a network into many clusters according to certain relationships among nodes. Moreover, community detection can classify nodes based on the topological structure of the network. It can reveal the hidden hierarchical structure of the real network and improve the performance and efficiency of storing, processing, and analysing network data. So far, there are many methods for community detection, such as spectral bisection algorithm [2], graph segmentation algorithm [3], heuristic algorithm [4], and objective optimization algorithm [5].

Community detection is an important branch of complex networks. Among the traditional community detection algorithms, the most famous algorithm is the spectral analysis algorithm based on network topology, which is referred to as spectral clustering [2] in the following. Its main idea is eigen-decomposing the similarity matrix of the network to obtain the main eigenvectors for finding communities. Not only is the spectral clustering algorithm applicable for a variety of data structures but also it utilizes dimensionality reduction to reduce computational complexity. Consequently, scholars began to research spectral clustering and optimize and expand on it.

Qin et al. [6] proposed a multisimilarity spectral method for clustering dynamic networks. It detects communities by bootstrapping the clustering of different similarity measures. Ulzii and Sanggil [7] designed an agglomerative spectral clustering method with conductance and edge weights. The most similar nodes are agglomerated based on eigenvector space and edge weights. Ding et al. [8] explored the equivalence relation between the nonnegative matrix factorization and spectral clustering and developed a semisupervised spectral clustering algorithm.

Spectral clustering typically constructs a similarity matrix with Euclidean distance between nodes. However, the Euclidean distance may lose the hidden relationship among nodes. As a result, the similarity matrices cannot contain complete community information. Clustering performance is not satisfied. If the constructed similarity matrix can approach the ideal matrix, the spectral clustering algorithm will have better clustering performance. Hence, constructing an excellent similarity matrix is the key to the spectral clustering community detection algorithm.

Nataliani and Yang [9] proposed a new affinity matrix generation method by using neighbour relation propagation principle. The method can increase the similarity of point pairs that should be in the same cluster. But the distance threshold is easily affected by outside points or noise points. Beauchemin [10] presented a method to build affinity matrices from a density estimator relying on K-means with subbagging procedure. However, this method would not work well when manifold proximity exists. Zhang and You [11] developed an approach based on a random walk to process the similarity matrix. The pairwise similarity is not only related to the two points but also related to their neighbours. However, the threshold of neighbouring nodes is set manually, and the stability of clustering is bad.

Although many community detection algorithms based on optimizing a similarity graph have been proposed, how to construct the similarity graph that can correctly reflect the community structure has not been solved. Consequently, this paper focuses on the transition probability between nodes to calculate the similarity, presents the concept of probability matrix, and proposes an improved spectral clustering community detection algorithm based on the probability matrix.

3. Improved Spectral Clustering Algorithm

3.1. Constructing a Similarity Graph by Probability Matrix

The similarity graph of spectral clustering is constructed by calculating the similarity between nodes. In this section, the similarity between nodes is calculated by the transition probability among nodes. And the related concepts of probability matrix and mean probability matrix are introduced. Then, the similarity graph is constructed based on the mean probability matrix.

3.1.1. Transition Probability

A Markov chain is a stochastic process of variables with Markov property, describing a sequence of states. The state changes over time, and the next state of the sequence depends on the current state [12]. The possibility of transition between states is called the transition probability.

Given a network N, the number of nodes is n, the adjacency matrix of N is W. The probability that node i reaches to node j after one step is the 1st transition probability, which can be defined as

The 1st transition matrix Pr is a matrix composed of entry prij, then Pr = DW−1·W, where DW = diag(dW0, dW1, …, dWn−1), .

The probability that node i reaches to node j after l steps is the l-th transition probability. And the matrix formed by l-th transition probability is called the l-th transition matrix Prl. According to the properties of the Markov chain, we can get

Prl denotes that Pr is multiplied by l times.

3.1.2. Probability Matrix

Definition 1. Given a network N(V, E), considering that the transition probability from node i to node j is pij, then the probability matrix of N is a V × V matrix composed of pij. The probability matrix can be referred to as P, P = (pij).
The probability matrix describes the transition probability between nodes in the network. The 1st transition probability can reflect the most direct relationship between the node and its adjacent nodes, but there is a lack of hidden relationship with the nonadjacent nodes. The multistep transition probability can include more neighbour nodes, reflecting the multiple complex associations among nodes. However, the multistep walk may fail to reach the adjacent nodes, which could weaken the relationship with the adjacency nodes. Consequently, we propose a method for constructing the probability matrix based on the accumulation of weighted multiorder transition matrices, and P can be defined aswhere i denotes that the current state of the Markov chain is at time i, t refers to the size of the Markov chain, called time scale, Pri is the i-th transition matrix, and represents the weight of Pri, i ∈ [2, t],  ∈ [0, 1], .

3.1.3. Mean Probability Matrix

The time scale is the key to calculating the similarity of nodes. But the optimal time scale of different networks is not necessarily the same. There will be mistakes in using a fixed time scale. In order to reduce the influence of parameters t and , the mean probability matrix obtained from the mean values of P with different time scales is proposed.

Definition 2. Given a network N(V, E), considering that its probability matrix is P and the time scale is t, then the mean probability matrix is a V × V matrix composed of the average of P1, P2, …, Pt. According to equations (2) and (3), the mean probability matrix can be referred to as PM:Not only does the time scale t of PM provide the size of time scale for each P, but more importantly, it specifies the number of summing probability matrices. It could take different P to average the error caused by t and . The mean would reduce mistakes, and the different value of t does not cause a great error. As a result, the value of t can be randomly chosen, but in order to reduce the computational complexity, we set t to be [5, 13].
The weight also represents the weight of the j-th transition matrix. According to Definition 2, we can get that i gradually changes, and the number of corresponding weights also gradually changes. To satisfy the constraint, is defined aswhere ws is a set of weights of size t, and it is artificially set, satisfying ws1 > ws2 ≥ … ≥ wst.

3.2. Improved Spectral Clustering Algorithm Based on Mean Probability Matrix
3.2.1. Constructing the New Similarity Graph

The similarity matrix WP is constructed by the mean probability matrix PM. Given a network N, the mean probability matrix of N is PM; then, the similarity between node i and j can be defined aswhere denotes the i, j-th entry of WP, and refers to the i, j-th entry of PM.

The similarity matrix of the traditional spectral clustering is a symmetric matrix, which is beneficial to calculate the Laplacian matrix L. Although WP is not a symmetric matrix, WP has special properties and can also construct L. The properties of WP are as follows:LW = D − WP, where D is a diagonal matrix, , the entries on the diagonal are positive, WP is a matrix with nonnegative entries, its diagonal entries are all 0, and each row of entries is not all 0. To sum up, it turns out that LW is a matrix where all the diagonal elements are positive, and the other elements are negative. Then, we obtain that LW is invertible.For any vector f, LW can satisfy

As a result, LW is a Laplacian matrix, and WP can construct a similar graph of spectral clustering.

3.2.2. NCut Objective Function

Spectral clustering has many different objective functions. The purpose of the objective functions is to find a partition of the network such that the edges between different communities have lower weight and the edges within the same community have a higher weight. In other words, nodes in different clusters are dissimilar from each other, and nodes within the same cluster are similar to each other.

The more popular functions are RatioCut [13] and NCut [14]. RatioCut focuses on maximizing the number of nodes in the community, while NCut pays attention to maximizing the weights in the community. Given a network N(V,E), they can be defined aswhere Ck denotes the set of nodes in the community k, , K represents the number of communities, refers to the complement of Ck, , , |Ck| is the number of nodes in Ck, and is the sum of the weights of edges in Ck, .

The number of nodes in the community does not mean that the weight in the community is high. In comparison, NCut is more consistent with the clustering strategy of spectral clustering. Therefore, we choose NCut as the objective function of the proposed algorithm. Combined with equation (6), the objective function can be optimized as

F is a matrix composed of vectors f, and I is the identity matrix. F can be obtained by solving the first K smallest eigenvectors of D−1/2 · LW · D−1/2. However, a little information is missing due to dimension reduction, resulting in the fact that F cannot fully indicate the attributes of nodes. Therefore, taking a traditional clustering on F, such as K-means, can divide the network into K communities more accurately in the end.

3.2.3. The Main Steps of the Algorithm

The main steps of the improved spectral clustering algorithm is given in Algorithm 1.

Input network N, adjacency matrix W, community number K, time scale t, and a set of weights ws
 Output K communities
(1)Compute the 1st transition matrix Pr according to equation (1)
(2)Compute the mean probability matrix P according to equation (4)
(3)Construct the similarity matrix WP according to equation (6)
(4)Construct the unnormalised Laplacian matrix LW according to the property 1 of WP in Section 3.2.1
(5)Construct the normalized Laplacian matrix Ln with Ln = D−1/2·LW·D−1/2
(6)Compute the first K eigenvectors of Ln, referred to as U
(7)Consider the rows of U as nodes, and use K-means to cluster them into K communities

4. Experiments and Analyses

The experimental data includes artificially generated networks and real networks. On the one hand, we use the LFR benchmark network [15] to generate the networks and evaluate the quality of community detection by normalized mutual information (NMI) [16]. On the other hand, we adopt several real networks and take the modularity (Q) [17] as the evaluation index.

In order to show the performance of the improved spectral clustering algorithm (ISCP), ISCP is compared with SC [2], WT [18], FG [19], FluidC [20], and SCRW [11]. The experimental environment includes Intel 2.5 Hz i7-4710MQ CPU and 8 G RAM. The software platform is PyCharm 2018.1.2 (Community Edition) in Windows 10 × 64.

4.1. LFR Benchmark Networks

The LFR benchmark networks are computer-generated networks, and they can produce different features of networks by adjusting some parameters. The experiments mainly use mixing parameter μ (μ denotes the average rate of edges connected with other communities, 0 ≤ μ ≤ 1) and network size N to evaluate performance. To guarantee consistency, the detailed descriptions and values of other parameters are shown in Table 1.

Figure 1 shows the performance of the six algorithms on μ. From Figure 1, we can get that the NMI trend of ISCP is smoother than other algorithms, and the NMI of ISCP is significantly higher than the other five algorithms. In [16], we can obtain that the larger the NMI is, the better the quality of community detection is. Overall, the clustering effect of ISCP is significantly better than the other five algorithms. In general, ISCP is more stable, and its convergence speed is faster.

Figure 2 demonstrates that the performance of the six algorithms on different network sizes N. As seen in Figure 2, the NMI of ISCP is higher than the other five algorithms. And as network size increases, its NMI increases. When the network size reaches 5000 or more, its NMI tends to be stable and stays around 0.9. Therefore, whether the order of magnitude of the network size is 1000 or 10,000, the clustering performance of ISCP is better than the other five algorithms.

4.2. Real-World Networks

The real-world networks have different topologies from the benchmark networks. To further evaluate the performance of the algorithms, 8 real-world networks are taken to do experiments. Moreover, it is necessary to normalize some real-world networks, such as eliminating self-loops and constructing a connected network. The detailed information of these networks is shown in Table 2.

The experiments take modularity Q to evaluate the clustering performance of the six algorithms. The range of Q is from −0.5 to 1. The larger Q is, the better the community detection performance will be. Q generally falls in about 0.3 to 0.7 in practice [17].

Figure 3 shows the performance of the six algorithms for clustering real-world networks. As shown in Figure 3, Q of ISCP is almost all above 0.3 and is larger than Q of the other algorithms. Although ISCP is not the best community detection algorithm for network 1 and network 7, its performance is very close to the best algorithm. Generally speaking, ISCP has excellent clustering performance and can cluster real-world networks more accurately.

5. Conclusions

Spectral clustering plays an important role in the field of community detection. It is an excellent community detection algorithm, but the traditional similarity graphs contain lots of incorrect information about the community structure. As a result, the performance of community detection is bad. Hence, this paper presents the probability matrix and proposes an improved spectral clustering community detection algorithm ISCP. A large number of experiments on benchmark networks and real-world networks show that ISCP is better than most traditional community detection algorithms and can more accurately cluster complex networks.

However, the ISCP will cost lots of time and space. Given a network N, the number of nodes is n, and the time scale is t. ISCP needs to multiply the transition probability matrix by t times to construct the similarity matrix. Even with the Fast Power algorithm, the time complexity of the algorithm will reach O(n3lbt). As the size of the network is larger, computing the similarity matrix will take more time and space. Moreover, ISCP is only applied to nonoverlapping complex networks. So the next step is to research how to optimize the computational complexity of the algorithm and how to cluster overlapping networks.

Data Availability

The data cannot be released for the time being. When the relevant research is finished, we will release detailed research results.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.