Community detection via an efficient nonconvex optimization approach based on modularity

doi:10.1016/j.csda.2020.107163

Computational Statistics & Data Analysis

Volume 157, May 2021, 107163

https://doi.org/10.1016/j.csda.2020.107163 Get rights and content

Abstract

Maximizing modularity is a widely used method for community detection, which is generally solved by approximate or greedy search because of its high complexity. In this paper, we propose a method, named MSM, for modularity maximization, which reformulates the modularity maximization problem as a subset identification problem and maximizes the surrogate of the modularity. The surrogate of the modularity is constructed by replacing the discontinuous indicator functions in the reformulated modularity function with the continuous truncated $L_{1}$ function. This makes the NP-hard problem of maximizing the modularity function approximately become a non-convex optimization problem, which can be efficiently solved via the DC (Difference of Convex Functions) Programming. The proposed MSM method can be used for community detection when the number of communities is given, and it can also be applied to the situation where the number of communities is unknown. Then, we demonstrate the advantages of the proposed MSM method by some simulation results and real data analyses.

Introduction

In recent years, network analysis is a very popular research direction in many fields. In the literature, there are many kinds of networks, including technological networks (Watts and Strogatz, 1998, Gastner and Newman, 2004), social networks (Travers and Milgram, 1969, Guimera et al., 2003), information networks (Flake et al., 2002), biological networks (Négyessy et al., 2006, Barabasi et al., 2011), etc. In many cases, the network units can be divided into groups with the property that there are many edges between the units in the same group, but relatively few edges between the units in the different groups. Such type of groups are viewed as communities, which are often associated with important structural characteristics of a complex system.

To recover the network communities, there have been a larger number of algorithms in the literature, including the greedy algorithms, such as the GN algorithm (Newman and Girvan, 2004) and the Lernighan–Lin algorithm (Kernighan and Lin, 1970), the algorithms based on optimizing some reasonable criteria over all possible partitions of networks, such as the spectral clustering methods (McSherry, 2001, Lei and Rinaldo, 2015) and the modularity optimization methods (Clauset et al., 2004, Newman, 2006a, Chen et al., 2018), the algorithms based on probability models, such as the stochastic block models (SBMs) (Holland et al., 1983, Zhang et al., 2017), the degree-corrected stochastic block models (DCSBMs) (Karrer and Newman, 2011, Chen et al., 2018) and the latent space models (Hoff et al., 2002). In addition, some methods are designed to deal with the community detection problem with overlap (Ball et al., 2011, Amini and Levina, 2018, Jin et al., 2019, Mao et al., 2020), that is, the nodes in the network may belong to more than one community.

In this paper, we mainly focus on the modularity optimization algorithms for network community detection, which are widely used due to their practicality and efficiency (Reichardt and Bornholdt, 2007, Chen et al., 2014). Modularity is considered to be one of the most important community detection criteria, which has the unique privilege of being at the same time a global criterion to define a community, a quality function and the key ingredient of the most popular method of graph clustering (Fortunato, 2010). Under the modularity framework, maximizing the modularity function is the key problem, which is actually a NP-hard problem (Newman, 2006b). The earliest algorithm for maximizing the modularity function is the GN algorithm (Newman and Girvan, 2004), which is a greedy algorithm. To reduce some useless operations of the GN algorithm in situation of sparse networks, Clauset et al. (2004) proposed the CNM algorithm. These two algorithms are based on hierarchical search, while some follow-up algorithms are established based on spectral optimization. For example, Newman (2006a) rewrote the expression of modularity as the eigenspectrum of the modularity matrix, and then proposed the EIGN algorithm based on this expression. In addition, there are some methods to optimize modularity based on block model (Chen et al., 2018).

These algorithms are approximate optimization of modularity, which try to find a proper balance between community detection accuracy and computational efficiency. Besides, there are many useful strategies for approximate optimization, some of which attempted to relax the binary membership assignment to a continuous version to ease the optimization (Amini and Levina, 2018, Liu et al., 2017). In particular, Liu et al. (2017) reconstructed the objection function of a subset selection problem with some indicator functions, and then approximated the indicator functions with the truncated $L_{1}$ function proposed by Shen et al. (2012). By drawing on the idea of Shen et al. (2012) and Liu et al. (2017), we reformulate the community detection problem as a subset identification problem, which is solved by maximizing the surrogate of modularity. Then, the proposed method is named as Maximizing the Surrogate of Modularity, which is written as MSM for short. Specifically, the surrogate of modularity is constructed by replacing the discontinuous indicator functions in the reformulated modularity function with the continuous truncated $L_{1}$ function as well as adding some regularization items like in Liu et al. (2017). As a result, the NP-hard problem of maximizing the modularity function approximately becomes a non-convex optimization problem, which can be efficiently solved via DC Programming (Le Thi and Tao, 2005). Then, we demonstrate the advantages of the proposed MSM method by some simulation results and real data analyses.

The rest of this paper is organized as follows. We elaborate the definition of modularity and the proposed algorithm in Section 2. Then, we present the simulation results of the proposed algorithm and some related algorithms in Section 3, followed by some real data analyses in Section 4. Finally, we conclude this paper in Section 5.

Section snippets

Modularity

First, we introduce some notation. Let $G = (V, E)$ denote a network with the node set $V = {1, \dots, n}$ and the edge set $E \subseteq V \times V$ , which can be formulated by the adjacency matrix $A \equiv [A_{i j}] \in {[0, + \infty)}^{n \times n}$ , where $A_{i j} > 0$ if $(i, j) \in E$ , otherwise $A_{i j} = 0$ . Suppose there is no self-loop in network $G$ , i.e. $A_{i i} = 0$ for each node $i \in V$ . Let $μ \equiv \sum_{i = 1}^{n} \sum_{j = 1}^{n} A_{i j}$ . For each $i \in V$ , let $d_{i}^{out} \equiv \sum_{j = 1}^{n} A_{i j}$ denote the out-degree and $d_{i}^{in} \equiv \sum_{j = 1}^{n} A_{j i}$ denote the in-degree. If $G$ is undirected, then $d_{i}^{in} = d_{i}^{out}$ , as $A_{i j} = A_{j i}$ for each $i, j \in V$ .

Suppose that $G$ has $K$

Simulation study

In this section, we present some simulation results to demonstrate the performance of the proposed MSM method, comparing with some classical community detection algorithms based on modularity, GN, CNM and EIGN, proposed in Newman and Girvan (2004), Clauset et al. (2004) and Newman (2006a) respectively, two relaxation algorithms based on block models, i.e. SDP $_$ 1 proposed in Amini and Levina (2018) and CMM proposed in Chen et al. (2018), and two overlapping community detection methods proposed by

Real data analyses

In this section, we investigate the performance of the proposed method as well as its competitors, via seven commonly used real world networks. A brief introduction of these networks is as follows. The network named Zachary’s Karate Club (Zachary, 1977) consists of 34 members and 78 edges, where the members were divided into two groups after a quarrel. The Visuotactile brain areas and connections (Négyessy et al., 2006) is a network describing the connections in the visual activity areas of the

Conclusion

In this paper, the MSM method has been established for finding a proper balance between community detection accuracy and computational efficiency, which is implemented by maximizing a surrogate of modularity. On this ground, the NP-hard combinatorial problem of maximizing modularity is approximately transformed into a nonconvex optimization problem, which can be solved by the DC Programming. The convergence of the MSM method is provided, and its good performance is presented by some simulation

Acknowledgements

This work was supported by NSFC grants 11571068, 11631003 and 11690012, the Special Fund for Key Laboratories of Jilin Province, China grant 20190201285JC, the project of teaching reform of higher education of Jilin Province, China grant JLL0824320190726182454.

References (35)

FortunatoS.
Community detection in graphs
Phys. Rep.
(2010)
HollandP.W. et al.
Stochastic blockmodels: First steps
Social Networks
(1983)
AminiA.A. et al.
Pseudo-likelihood methods for community detection in large sparse networks
Ann. Statist.
(2013)
AminiA.A. et al.
On semidefinite relaxations for the block model
Ann. Statist.
(2018)
BallB. et al.
Efficient and principled method for detecting communities in networks
Phys. Rev. E
(2011)
BarabasiA. et al.
Network medicine: a network-based approach to human disease
Nature Rev. Genet.
(2011)
ChenM. et al.
Community detection via maximization of modularity and its variants
Comput. Soc. Syst. IEEE Trans.
(2014)
ChenY. et al.
Convexified modularity maximization for degree-corrected stochastic block models
Ann. Statist.
(2018)
ClausetA. et al.
Finding community structure in very large networks
Phys. Rev. E
(2004)
FlakeG. et al.
Self-organization and identification of Web communities
Computer
(2002)

GastnerM.T. et al.

Diffusion-based method for producing density-equalizing maps

Proc. Natl. Acad. Sci.

(2004)

GleiserP.M. et al.

Community structure in jazz

Adv. Complex Syst.

(2003)

GuimeraR. et al.

Self-similar community structure in a network of human interactions

Phys. Rev. E

(2003)

HoffP.D. et al.

Latent space approaches to social network analysis

J. Amer. Statist. Assoc.

(2002)

JeongH. et al.

The large-scale organization of metabolic networks

Nature

(2000)

JinJ. et al.

Estimating network memberships by simplex vertex hunting

(2019)

KarrerB. et al.

Stochastic blockmodels and community structure in networks

Phys. Rev. E

(2011)

Cited by (0)

View full text

Community detection via an efficient nonconvex optimization approach based on modularity

Abstract

Introduction

Section snippets

Modularity

Simulation study

Real data analyses

Conclusion

Acknowledgements

Phys. Rep.

Social Networks

Pseudo-likelihood methods for community detection in large sparse networks

Ann. Statist.

On semidefinite relaxations for the block model

Ann. Statist.

Efficient and principled method for detecting communities in networks

Phys. Rev. E

Network medicine: a network-based approach to human disease

Nature Rev. Genet.

Community detection via maximization of modularity and its variants

Comput. Soc. Syst. IEEE Trans.

Convexified modularity maximization for degree-corrected stochastic block models

Ann. Statist.

Finding community structure in very large networks

Phys. Rev. E

Self-organization and identification of Web communities

Computer

Diffusion-based method for producing density-equalizing maps

Proc. Natl. Acad. Sci.

Community structure in jazz

Adv. Complex Syst.

Self-similar community structure in a network of human interactions

Phys. Rev. E

Latent space approaches to social network analysis

J. Amer. Statist. Assoc.

The large-scale organization of metabolic networks

Nature

Estimating network memberships by simplex vertex hunting

Stochastic blockmodels and community structure in networks

Phys. Rev. E