Brought to you by:
Letter

Multi-objective optimization for community detection in multilayer networks

, , , , and

Published 6 September 2021 Copyright © 2021 EPLA
, , Citation Shihong Jiang et al 2021 EPL 135 18001 DOI 10.1209/0295-5075/135/18001

0295-5075/135/1/18001

Abstract

Community detection in multilayer networks plays a key role in revealing the multiple aspects of information spreading and in comprehending the relationships and interactions within and between each layer. However, most existing algorithms are prone to local optimality, and they are also difficult to extend to high-dimensional networks. To address these challenges, we propose here a multi-objective algorithm for community detection that is based on the genetic algorithm. In particular, the modularity is introduced to optimize each network layer iteratively, and the local search is combined with genetic operations to overcome local optimality. Comparative benchmarks with other algorithms on artificial and real-world networks show that the proposed algorithm performs better, especially on high-dimensional networks.

Export citation and abstract BibTeX RIS

Introduction

Complex networks are a concise and valid methodology in representing the relevant structure among interacting units of a real complex system [14]. Community structure is a classic feature of networks [5], which can effectively reflect some typical characteristics, such as the involving regularity of network structures [68]. Although community detection algorithms of single-layer networks have made great achievements, an object in the real world tends to exhibit multi-dimensional attributes. For example, each person has multiple social accounts and can interact with each other by various social media platforms [9]. Each layer of a network depicts the connections among objects, and can reflect some properties of objects in different dimensions. Although each layer has its own unique community structure in multilayer networks, it cannot represent the composite structures of the whole network. How to use the complementary information provided by different layers to obtain the community structure of multilayer networks is one of the focuses of the current multilayer networks analysis, which has practical significance for a more real and more accurate understanding of the multi-dimensional structure of complex systems [ [1014].

In recent years, due to the need for practical application, the research on multilayer networks community detection algorithms has attracted more attention [15]. The most intuitional methods are extended from the single-layer methods. Ma et al. divide these approaches into two categories, i.e., single-analysis–based methods and multi-analysis–based methods [16]. The former one collapses the network into a single-layer network and the single-layer algorithms are used to explore the community division. The second kind of methods applies the single-layer method to every layer and mines the final community by using the consensus clustering approach. However, one shortcoming stands out according to which they fail to preserve the information of networks and ignore the interactions between different layers. To overcome the above problems, the multilayer network community detection is transformed into a multi-objective optimization problem.

In the multi-objective optimization method, multiple objectives are optimized simultaneously to ensure the integrity of network information [12]. However, how to select the optimal solution which adapts to different network structure from the Pareto Sets still remains a problem. Based on this, a novel GA-based multi-objective algorithm based on the NSGA_II multi-objective optimize framework, denoted as NSGAMOF, is proposed for detecting the multilayer networks community structure in this paper. The proposed method balances each layer to preserve the information of the multilayer network and different selection strategies are proposed to adapt to various networks.

Related work

This section introduces formulations and algorithms of multilayer community detection.

Formulations of multilayer networks

A multilayer network is abstracted to $G = (\{G_1,\ldots,G_l\}, R)$ , $G_l=(V_l,E_l)$ stands for a network at the l-th layer where Vl represents nodes and El the inter-layer links. $R = \{E_{ij} \subset V_i \times V_j, i, j \in{1,\ldots,l}, i \ne j \}$ indicates connections among nodes of layer Gi and Gj . The elements of R represent inter-layer connections or crossed layer connections. That is, l subnetworks and the connections among these l networks make up a multilayer network together.

There are few indexes to directly assess the quality of the multilayer networks compound community. At present, the composite modularity $(Q_c^{\prime})$  [17], redundancy index (Rc ) [18], normalized mutual information (NMI) [19] and adjusted rand index (ARI) [20] are commonly used as the evaluation indicators.

As defined in eq. (1), the composite modular $Q_c^{\prime}$ is used to estimate the community index. The greater $Q_c^{\prime}$ is, the better the quality of the multilayer network compound community performs. M denotes the amount of communication links and n indicates dimensions or communication layers of a network. A denotes the corresponding adjacent. d infers to the degree of a node. Xi indicates that node i belongs to community X. When $X_i=X_j$ , $\delta(X_i, X_j)=1$ , 0 otherwise:

Equation (1)

The redundancy index Rc  [18] denotes the calculated ratio of redundant connections of multilayer networks. The larger Rc is, the better the quality of the compound community is. Intuitively, a compound community should have links across multiple layers. $\|p\|$ represents the amount of communities. $S_i^{\prime}$ denotes a couple $\{c_1,c_2\}$ that can be connected at the lowest layer of the G community. If $\{c_1,c_2\} \in S_i^{\prime}$ , then $\beta(c_1,c_2,E_l)=1$ , otherwise $B(c_1,c_2,E_l)=0$ . The redundancy index is formulated in the following equation:

Equation (2)

The normalized mutual information (NMI) can assess the comparability of the optimized community and the original one [19]. $F^{\prime}$ denotes a confusion matrix. $F_x^{\prime} (F_y^{\prime})$ denotes the amount of elements in the x-th row (or the y-th column) in $F^{\prime}$ . $r_{c_1}~(r_{c_2})$ signifies the total amount of clustering in a partition c1 (c2). The range of NMI value is [0, 1]. The greater the NMI is, the higher the similarity between optimized and original networks is. Suppose that c1 and c2 are two partitions in a network, then the following equation is applied to calculate the value of $\textit{NMI}(c_1,c_2)$ :

Equation (3)

The adjusted Rand index $(\textit{ARI})$ is used to calculate the similarity between real communities and clustering ones. e and h denote the number of node pairs. The former one denotes that located in the same community in the real partition (c1) and the obtained one (c2), the other in a disparate community in c1 and c2. f and g are for the quantity of node pairs. f denotes which located in the same community in c1 and different community in c2. g represents that located in the same community in c2 and different community in c1. The more similar the real partition and the obtained ones are, the greater the value of $\textit{ARI}$ is. $\textit{ARI}$ can be defined as follows [20]:

Equation (4)

In this paper, the modularity is used as the optimization function, and NMI, Rc , ARI are used as evaluation indexes due to their capability of finding high-quality solutions in multilayer networks [11,2124].

Multilayer networks community detection

Existing community detection algorithms for multiplex networks fall into four categories, i.e., matrix-decomposition–based method, spectral-based method, information-based method, and modularity-based method.

Matrix-decomposition–based methods extract the community division by decomposing the matrix such as NMF [ [16,22], SNMF [11,25] and so on. Spectral-based methods compute the community division by employing eigendecomposition to Laplace matrices, such as MIMOSA [26], and SC-ML [27]. The performance of such kinds of algorithm stands out because they capture the global information across different layers. Information-diffusion–based methods integrate different layers of the multilayer network by employing the diffusion of networks [28]. For example, similarity network fusion (SNF) computes the fused matrix of all layers through a parallel interchanging diffusion process, and then explores the community division by employing the spectral clustering method to the fused matrix [21]. The generalized Louvain (GL), one of the most efficient modularity-based algorithms, achieves a great efficiency by optimizing their generalized modularity function [23]. However, GL suffers from great difficulties in mining the consensus community division of all layers.

As for the methods mentioned above, they can handle almost all multilayer networks accurately and efficiently, but some drawbacks are unavoidable [11]. For example, the spectral-clustering–based method is good at extracting small-scale and tight communities, which may lead to some crucial information such as SC-ML being ignored. Many modularity-based algorithms are extended from the single-layer method that do not address the information across different layers. To overcome these problems, a novel multi-objective optimization algorithm is proposed to balance the community structure of every layer.

Formulation of NSGAMOF

This section gives an elaborate explanation about the NSGAMOF optimization algorithm from the code scheme, genetic manipulation, local search and diverse optimal selection policy to prove the algorithm can preserve the information of networks and select the adaptive solution respectively. Figure 1 shows the framework of the NSGAMOF algorithm.

Fig. 1:

Fig. 1: The framework of NSGAMOF, which consists of (a) hill-climbing method, (b) the main body of multi-objective optimization and (c) the optimal solution selection strategy.

Standard image

Code scheme

An appropriate code scheme of the solution can effectively reduce the computation and expedite the algorithm convergence. The label-based and locus-based representation play an important role in encoding methods for community detection. However, the label-based code scheme is redundant, which means that if there are t labels in the pattern, then t! different chromosomes might be mapped to the same division [29]. In order to get the utmost output of the information contained in the pattern, this paper employs the locus-based adjacency presentation scheme. Suppose the solution of chromosomes in the population is set to $S=\{s_1, s_2,\ldots,s_N\}$ . The length of the gene is N and each gene i can be arbitrary integer between 1 and N, namely, 1 < i < N. The i-th gene value could be j, if i, j are linked at the lowest one layer in a network.

Genetic operators

The uniform crossover is applied for genetic operation in this paper. In fact, the uniform crossover pertains to multi-point crossover category, and becomes an efficient operator in evolutionary algorithms. First, the binary mask chromosomes with equal number of nodes are generated at random. According to the mask, the corresponding gene bits are selected from two parent chromosomes to form new chromosome as offspring chromosomes. To be specific, if the mask is 0, the operator selects the gene of the first parent chromosome; otherwise, the operator selects the second parent chromosome.

The mutation operator in chromosomes integrates the correlative information of the layer nodes neighbourhood, which makes random mutations in the form of the probability. The i-th gene in a chromosome is selected at random in the form of predefined probability, and then this gene mutates into the j-th neighbour of the i-th gene.

Local search operation

The NSGAMOF algorithm includes local search operations, namely the hill-climbing (HC) method. The first step is to define the neighbours of a chromosome. Given a chromosome $S_k = \{ S_k^1, S_k^2, \ldots, S_k^n \}$ , node $S_k^i$ is selected from chromosome Sk at random. Then, the gene $S_k^i$ is a substitute for other neighbour nodes $S_k^j$ of the location i, and $S_k^j \neq S_k^i$ , denoted by $S_k^{\prime}$ . The newly generated chromosome $S_k^{\prime}$ is used as a neighbour of chromosome Sk . In local search operations, a chromosome is chosen to be refined at random. Meanwhile, all possible neighbour chromosomes are identified. If the newly generated chromosome is better than the original one, we replace it with the new chromosome.

Optimal selection strategy

In MOOP (multi-objective optimization), a set of optimal solution sets will be obtained, which signify the optimal trade-off among optimization objectives. In order to obtain the final solution of PS in MOOP, we use the disparate optimal solution selection policy. There are three types of multi-criterion decision making: 1) a posteriori-based, 2) a priori-based, and 3) interactive-based [30].

The a priori-based policy leverages the value maximization of the objective function (i.e., the modularity maximum) which is the optimal solution in the Pareto set. This policy using prior knowledge is called NSGAMOF-prik.

Different from the a priori-based policy, the a posteriori-based strategy takes the overall information of the network into the consideration. It calculates the average modularity of each layer and the chromosome with the highest average value is the optimal solution. In this case, the method is called NSGAMOF-postk.

The last one is the interactive-based strategy, which mainly uses the k-means clustering method to classify the collection of data got by disparate models and enhance the quality of community detection. Such a strategy which selects the fittest solution by using the clustering method in the Pareto solution is called NSGAMOF-clu. Figure 2 shows an example of NSGAMOF-clu with three-layer networks and six nodes per layer.

Fig. 2:

Fig. 2: An example of NSGAMOF-clu algorithm with three-layer networks and six nodes in each layer. The community structure in each layer is searched in Step 1 and the compound structure is obtained through k-means clustering in Step 2.

Standard image

The optimal selection strategy mentioned above can adapt to different network structures, which ensures a high performance. For example, the prik-based method adapts to networks which have uneven information distribution for each layer (i.e., most of the information exists in one layer). However, for the network with the uniform information distribution, the prik-based approach may lose some information but the postk-based strategy can address the problem. In general, the proposed strategies can improve the accuracy for multilayer networks with various structures.

Experiments

This section compares NSGAMOF with the other advanced methods in both real and synthetic multilayer datasets. The result is the average value got by executing the method for 100 times. The population size and iteration number are set as 200 and 100. We set crossover and mutation as 0.8 and 0.2, respectively.

Datasets

The synthetic network m-LFR128 function has 128 nodes in each layer [31]. The actual partition structure is known. To change parameters allows to control the total edges among communities and the difference of node degrees among layers. Therefore, 2, 3 and 4 layers of the network are generated respectively and these networks have different network topologies. The mixture parameter u signifies the part of the connection between one node and every other node in a community. The quality of community detecting usually becomes more dissatisfying as u gets larger. The degree of each node in different network layers is determined by Dc, that is, the degree change chance. The greater the parameter Dc is, the more diverse the nodes of various layers might be. Moreover, eight real-world network datasets are used in this paper, i.e., Kapferer Tailor Shop, KTS 1 , Cs-Aarhus Social, CAS [32], Bibliographic Data, CoRA [33], Mobile Phone Data, MPD [34], Worm Brain Networks, WBN 2 , Word Trade Networks, WTN [35] and Social Network Nata, SND including two different ground truth divisions, i.e., SND(o) and SND(s) [36] (see table 1).

Table 1:. The structural information of the real-world networks. The Layers and Nodes denote the number of layers and nodes of multilayer networks, respectively.

 KTSCASCoRAMPDWBNWTNSND(o)SND(s)
Layers 4523101433
Nodes 39611662872791837171

Experiments on synthetic networks

Figure 3 compares three strategies of NSGAMOF with the MLMaOP algorithm (i.e., proj, cspa, mf) [14] in terms of NMI based on different parameters (i.e., d, Dc and u) in 12 network structures from mLFR dataset. Results show that each strategy the NSGAMOF algorithm performs better than all strategies in MLMaOP algorithm. Moreover, the increasing number of layers scarcely influences the performance of NSGAMOF algorithms. Although the parameter Dc affects the network architecture, NSGAMOF can still find out the optimal division beneath diverse network architectures. We can conclude that NSGAMOF algorithms (i.e., NSGAMOF-prik, NSGAMOF-clu, NSGAMOF-postk) perform better than MLMaOP algorithms (i.e., proj, cspa, mf) in the diverse optimal selection policy, network architecture and network layers.

Fig. 3:

Fig. 3: NMI results with the changes of network topologies (u and Dc codetermine the network structure) and the layers increasing for mLFR networks. The results indicate that the network structure and total number of layers marginally affect the NSGAMOF (i.e., NSGAMOF-prik, NSGAMOF-clu, NSGAMOF-postk). By comparison, the performance of proposed NSGAMOF is superior to that of MLMaOPc (i.e., proj, cspa, mf).

Standard image

Experiments on real-world networks

Comparative results of NSGAMOF-prik, NSGAMOF-clu, NSGAMOF-postk, BGLL [37], the MOEA-MultiNet algorithms [11] in the KAPFERER TAILOR SHOP and CS-AARHUS SOCIAL multilayer networks are shown in table 2. More specifically, L1, L2, L3 and L4 of the one-layer strategies show that the simplified composite modular $Q_c^{\prime}$ and redundancy Rc are calculated by utilizing each community returned by BGLL algorithms in each layer of the networks. MOEA-MultiNet and NSGAMOF of multilayer strategies calculate the corresponding simplified composite modular $Q_c^{\prime}$ and redundancy Rc.

Table 2:. The $Q_c^{\prime}$ and Rc metric comparative results.

DatasetStrategyAlgorithm $Q_c^{\prime}$ Rc
KAPFERER TAILOR SHOP NETWORKOne-layerBGLL/L1 0.21790.3964
BGLL/L2 0.20060.4717
BGLL/L3 0.13800.2657
BGLL/L4 0.09320.4094
MultilayerMOEA-MultiNet0.20940.4735
NSGAMOF-prik0.43430.3705
NSGAMOF-clu0.46980.3511
NSGAMOF-postk0.48100.3134
CS AARHUS NETWORKOne-layerBGLL/L1 0.46850.2852
BGLL/L2 0.16720.0472
BGLL/L3 0.08320.1205
BGLL/L4 0.28930.1611
BGLL/L5 0.41150.2715
MultilayerMOEA-MultiNet0.40100.3186
NSGAMOF-prik0.23160.3703
NSGAMOF-clu0.23150.3617
NSGAMOF-postk0.22870.3847

Results show that the multilayer strategy is superior to the one-layer strategy in solving multilayer network community detection. And the proposed NSGAMOF-postk algorithms of multilayer strategy outperforms obviously the several other algorithms, especially in simplified composite modular metric. In general, we can conclude that the compound community structure acquired by the proposed algorithms NSGAMOF-prik, NSGAMOF-clu, NSGAMOF-postk) have better performance than the BGLLs [37] and the MOEA-MultiNet algorithm [11] in the single layer.

Tables 3 and 4 further compare the proposed strategies of NSGAMOF with other algorithms (i.e., CSNMF, CPNMF, CSNMTF, CSsNMTF [11], SNF [21], SC-ML [27], CGC [22], GL [23] and Infomap [38]) on five different real multilayer datasets in terms of NMI and ARI, respectively. Obviously, at any rate one of the presented methods is superior to other algorithms, especially in NMI metrics in most real-world datasets. And the ARI index value of the proposed algorithm has significant advantages over other algorithms. In addition, to validate the performance of the proposed algorithm roundly, the results on the HBN network, which contains 13281 nodes and 8 layers, are shown in table 5. The results denote that the proposed algorithm achieves an acceptable result.

Table 3:. The NMI comparison results on real-world multilayer networks datasets. prik, clu and postk denote NSGAMOF-prik, NSGAMOF-clu and NSGAMOF-postk, respectively. The dash means that MIMOSA does not work on such dataset.

 CoRAMPDWBNSND(o)SND(s)WTN
prik0.7160.5540.3720.5780.2210.266
clu0.8340.4640.3870.5530.1630.266
postk0.5750.4680.3630.6910.0580.291
CSNMF0.5140.5040.4630.6810.0530.322
CPNMF0.4800.4510.4320.6850.0530.172
CSNMTF0.3460.4580.4040.7730.2760.154
CSsNMTF0.390.5210.4240.6780.0340.155
SNF0.4490.3950.4250.6890.0570.073
SC-ML0.480.4950.0790.6810.0300.226
CGC0.3890.4570.3700.6730.0780.072
GL0.4180.4670.3980.6180.0970.426
Infomap0.3730.4100.3550.0010.0010.273
MIMOSA0.0110.0960.0200.1320.046
S2-jNMF0.7960.5160.0720.5820.0370.157
Comclus0.4710.4210.3620.5550.0730.374
GMC0.5190.4510.2420.5970.0510.204

Table 4:.  ARI comparison results. prik, clu and postk denote NSGAMOF-prik, NSGAMOF-clu and NSGAMOF-postk, respectively. The dash means that MIMOSA does not work on such dataset.

 CoRAMPDWBNSND(o)SND(s)WTN
prik0.7770.4090.2140.4620.2570.196
clu0.8790.3730.2460.4350.1770.174
postk0.6450.350.3130.5440.0580.203
CSNMF0.4910.3940.2910.4930.0590.160
CPNMF0.470.3680.2330.5030.0590.094
CSNMTF0.2790.3460.2370.8110.2340.035
CSsNMTF0.2880.4220.2250.4840.0310.088
SNF0.470.280.2110.5150.0580.005
SC-ML0.4850.3790.0010.4930.0210.133
CGC0.2960.3570.2110.4720.0920.002
GL0.3340.3720.1850.4600.0890.068
Infomap0.0160.1150.2000.0120.0010.093
MIMOSA0.0010.0100.0050.0860.001
S2-jNMF0.8130.3960.0760.4520.0230.069
Comclus0.4470.3650.2510.4810.0860.269
GMC0.4260.2480.0120.4280.0340.015

Table 5:. The results on large-scale datasets. The Rc and $Q_{c}^{\prime}$ are chosen as the metrics to evaluate the performance of the algorithm when running on the large-scale datasets.

 prikclupostkComclus
Qm 0.1240.1280.0870.1775
Rc 0.0570.0650.0530.001

Analysis of selection strategies

Although the ground truth represents consensus division of a kind, the information and importance of each layer are different. To identify the varying frequency of the different importance of each layer, three strategies are proposed to better adapt to different network structures. Here, two datasets (i.e., MPD and SND(o)) are used to verify the validity of the proposed strategies. Figure 4 plots the NMI and ARI of each layer.

Fig. 4:

Fig. 4: The NMI and ARI of each layer. There exist relative differences in different layers for a multilayer network. Different strategies can adapt to different multilayer networks to improve the accuracy of algorithms.

Standard image

For the results shown in tables 3 and 4, the interactive-based strategy (postk) performs well when run on the SND(o) dataset and the a priori-based strategy acquires a better solution when run on the MPD dataset. For SND(o), the best solution is extracted by choosing the maximum average of all layers since complementary information can improve the accuracy, but for MPD, the information of other layers reduces the accuracy so the prik strategy get the best solution. The experimental result demonstrates that the proposed strategies can improve the performance by adapting to various data structures.

Convergence analysis

To demonstrate the convergence performance of NSGAMOF, NMI and conductance indices are calculated to observe the convergence of the NSGAMOF algorithm. As shown in fig. 5, the NSGAMOF algorithm converges slowly at the beginning, and then the convergence starts to accelerate after 55 iterations. Finally, the NSGAMOF algorithm stably converges within 90 times. In the whole operation process of the NSGAMOF algorithm, the curve of the NMI index generally rises and tends to be stable after some reasonable fluctuation, which proves that NSGAMOF algorithm has strong convergence.

Fig. 5:

Fig. 5: The results of conductance and NMI running on (a) CoRA and (b) MPD. The red line denotes the conductance. The smaller the index value is, the better the performance is; the blue line denotes NMI, and the greater the value is, the better the performance is.

Standard image

Scalability analysis

Figure 6 shows the result of scalability as the network nodes and dimension increase. The experiments employ 10 diverse datasets with 2 dimensions. In fig. 6(a), the curve exhibits the characteristics of a quadratic equation as the networks increase in size. Overall, it is reasonable for the execution time of the NSGAMOF-prik algorithm. Figure 6(b) depicts the worst running time of the proposed algorithm as the network dimension increases. According to this result, the algorithm presents the scalability for the high-dimensional network.

Fig. 6:

Fig. 6: (a) With the growth of node numbers, the worst running time changes. It can reflect that the scalability of the algorithm varies with the network size. (b) With the growth of dimensions, the worst running time changes. The scalability of the algorithm varies with the dimension of networks.

Standard image

Conclusion

In order to effectively balance the community structure of every layer to obtain a high-quality compound community, we transform the multilayer network community detection into a MOOP and present a new GA-based multi-objective optimization algorithm NSGAMOF for multilayer network community detection. The concept of modularity and local search are introduced to optimize every layer of a network iteratively. The diverse optimal selection policy is applied to ensure the relatively optimal compound community structure.

Acknowledgments

This work was supported by the National Key R&D Program of China (No. 2019YFB2102300), National Natural Science Foundation of China (Nos. 61976181, 11931015, U1803263), Fok Ying-Tong Education Foundation, China (No. 171105), Key Technology Research and Development Program of Science and Technology-Scientific and Technological Innovation Team of Shaanxi Province (No. 2020TD-013), and by the Slovenian Research Agency (Nos. P1-0403 and J1-2457).

Data availability statement: All data that support the findings of this study are included within the article (and any supplementary files).

Footnotes

Please wait… references are loading.