1 Introduction

Footnote 1Nowadays, networks are ubiquitous and often cluttered, leading to difficulties for recognizing patterns and mining knowledge from them. The first step to the understanding of the network structures is to arrange the networks in an organized manner: identifying nodes with similar attributes or functions and combining them together as a group or cluster. In fact, most real-world networks are empirically observed to possess the community structure [24], where nodes with analogous properties compose functional modules in networks. For instance, in online social networks, users form groups according to common experiences, affiliations, or hobbies; in biological networks, cells with similar functions constitute tissues; in research networks, researchers with similar interests comprise research fields or disciplines. Revealing the hidden community structure can significantly simplify the representations of networks and facilitate the comprehension of networks.

Given the importance of community structure, various community detection approaches have been proposed in the literature to identify meaningful communities in networks [2]. Existing community detection methods can be categorized into two classes: graph-theoretic approaches and probabilistic generative models. In traditional graph-theoretic approaches, general clustering methods such as hierarchy clustering [5], k-means clustering [6], and spectral clustering [7] are applied to detect communities in networks. In the recent decade, various novel graph-theoretic methods were proposed. Divisive methods (e.g., Newman and Girvan [8]) iteratively deleted the identified inter-community links to separate the entire network into isolated communities. Newman [9] initiated the concept of modularity to assess the quality of the detected communities while the optimization of modularity led to a series of detection algorithms. Clique percolation method by Palla et al. [10] detected overlapping cohesive clusters, or cliques in networks. The graph-theoretic community detection methods utilize various graph-theoretic measures to identify cohesive groups of nodes in the networks. Additionally, the performance limits of community detection in various random graph models are investigated in [1113].

In probabilistic generative models, the observed network is regarded as the ramification of a community structure-related probabilistic generative process and the community detection problem is posed as a statistical inference problem. Among this category, Airoldi et al. presented the mixed membership stochastic blockmodel in [14], where each node’s community affiliation strengths among all the communities are regarded as a probability distribution. A variational EM algorithm was proposed to infer the community affiliations efficiently. Yang and Leskovec proposed the affiliated graph model (AGM) in [15]. In AGM, total community affiliation strengths are allowed to vary from node to node, leading to more degrees of freedom in modeling overlapping communities. Sun et al. investigated community detection in heterogeneous information networks [16], where different types of nodes, e.g., authors, papers, and venues for citation networks, are present.

In real-life networks such as the Facebook friendship network and the DBLP collaboration network, nodes form links with each other through intelligent interactions. For example, in a social network, users interact with each other on common hobbies, experiences, affiliations, and finally decide whether to make connections (friends), i.e., to form a link, or not. Our hypothesis is that users are rational in forming their social networks, in other words, when deciding whether to form a link or not, a user will judge if the benefit of this link is worthy of its cost (efforts and time spent in the relation). Hitherto, such strategic interactions among nodes have not been considered in community detection yet. On the other hand, game theory, originating from microeconomics, is a mathematical tool that has been applied to various engineering problems to model the strategic interactions among rational players [1722]. The outcome of the mutual interactions between rational players can be predicted by using game theory. This motivates us to resort to game theory to investigate the interactions among nodes in a network with community structure.

Most real-world observations of networks are noisy and incomplete, i.e., there are missing links and fake links in the observed graph, due to technological constraints or privacy regulations. For instance, observations in social networks are often incomplete because of the privacy policy of social websites and the flaws of the acquisition approaches, not to mention the social network data cannot track all the interactions among users. So far, no existing work has studied the behavior of community detection algorithms under a generative model of noise in observed networks. This motivates us to consider the community detection problem in both noiseless networks and noisy networks. We find that the link errors or noise can be well absorbed into the proposed game theoretic framework. The main contributions of this paper are summarized as follows.

  • We propose a game-theoretic framework to model the interactions among strategic nodes in a network with community structure. The network can be either noiseless or noisy. The proposed link formation game connects the observed network structure with the hidden community structure.

  • The Nash equilibrium (NE) of the noiseless network game and the subgame perfect equilibrium (SPE) of the noisy network game are derived. With these equilibria, a game-theoretic generative model of networks is obtained, which enables community detection in both noiseless networks and noisy networks.

  • According to the proposed game-theoretic generative model, we derive a general community detection algorithm based on expectation maximization (EM) for both noiseless networks and noisy networks. The effectiveness of the proposed detection algorithm is validated through simulations on synthetic networks and experiments on real-world networks.

The roadmap of the rest of this paper is as follows. In Section 2, we elaborate the game-theoretic model and present the equilibrium analyses. In Section 3, an EM-based general community detection algorithm is presented according to the proposed game-theoretic model. In Section 4, simulations as well as real-world dataset experiments are conducted. In Section 5, we conclude this work.

2 Game-theoretic generative model of the networks

Game theory is a mathematical tool used to study the strategic interactions among multiple rational decision makers [23]. A game consists of (i) a set of players, (ii) a set of actions for each player, and (iii) a set of utilities for each player given the actions of all the players. Outcomes of the games can be obtained by resorting to solution concepts such as Nash equilibrium and subgame perfect equilibrium, which will be discussed later. In a network, each node (e.g., users in a social network) can be modeled as a rational player. The nodes interact with each other to form links, generating the graph structure that we observe. The interactions can be illustrated as in Fig. 1. The utilities of the interactions depend on the community affiliations of the nodes. The fundamental hypothesis of this work is that, for two users in a social network, if they both belong to a certain group, then they prefer to be friends, because forming a link will result in higher utilities for both parties. The interactions among nodes in a network can be analogous to the interactions among players in a game. Therefore, game theory is indeed an ideal tool to model and understand the community detection problem. Moreover, due to the acquisition errors, the networks acquired from the real-world data may be noisy and incomplete, i.e., many true links can be missed and lots of spurious links may be formed. In this paper, we will show that the proposed game theoretic framework can tackle such an issue, and therefore, we will consider community detection over both noiseless networks and noisy networks.

Fig. 1
figure 1

Graphical illustration of the interactions of nodes to form links. Each circle corresponds to a community in the network. Each pair of nodes interacts to decide whether to form a link with each other or not. The utility function of the interactions depends on the community affiliations of the nodes

In the following, we first present our proposed game-theoretic generative models for both noiseless networks and noisy networks. Specifically, given the nodes’ utility functions, which depend on their community affiliations, we derive the equilibrium of the pairwise link probability and based on which we propose generative models of the networks.

Consider a network with N nodes and K communities. For each user u∈{1,2,…,N}, we denote the nonnegative vector \(\mathbf {x}_{u}\in \mathbb {R}^{K}\) as its community affiliation strength vector, whose kth component represents the strength of node u’s affiliation to community k. The larger a certain entry of xu, the stronger the affiliation of node u to the corresponding community.

2.1 Game for noiseless networks

Each pair of nodes interacts with each other to decide whether to form a link or not. Specifically, when two nodes u,v interact, they play the following game:

  • Pure strategies: {Link, Not Link}.

  • Mixed strategies: [ 0,1], the probability of Link.

  • Utility functions:

    1. 1.

      If both nodes choose Link, then each one gets utility 1.

    2. 2.

      If both nodes choose Not Link, then each one gets utility 0.

    3. 3.

      From node u’s perspective, (i) if it chooses Not Link but its opponent v chooses Link, then it may get some one-shot information sharing or benefits from v and thus gets utility f1(xu,xv); (ii) if it chooses Link but its opponent v chooses Not Link, then it may have spent some efforts on trying to make this connection and thus gets (possibly negative) utility f2(xu,xv). We assume that f1 and f2 are symmetric functions, i.e., fi(xu,xv)=fi(xv,xu),i∈{1,2} so that the utility structure of the pair {u,v} is symmetric. The utility functions are summarized in Table 1.

      Table 1 The utility table of the game for noiseless networks

We note that the above proposed game contains two general functions f1 and f2. Different choices for these two functions lead to different games, and hence different game-theoretic generative models of the networks. For general f1,f2, the Nash equilibrium (NE) of the proposed game is identified in the following proposition. We consider two regions for the utility function f1 and f2: f1(xu,xv)<1,f2(xu,xv)<0 and f1(xu,xv)>1,f2(xu,xv)>0. We explain these two regions of utilities from node u’s perspective as follows. In the first region, if it selects Not Link while the opponent node v selects Link, then it gets some one-shot benefits from v such as sharing of information but loses long-term potential benefits from the potential connection so that its utility f1(xu,xv) is smaller than 1. If it selects Link while node v selects Not Link, then it may have spent some efforts on trying to establish the connection and therefore lose some utility, i.e., its utility f2(xu,xv) is less than 0. In the second region, if it selects Not Link while the node v selects Link, then it gets a one-shot benefit f1(xu,xv) larger than 1 since it does not need to pay any efforts on establishing the connection. If it selects Link while node v selects Not Link, then though the connection is not established, it can make node v to know better about it or advertise itself. Hence, it still gets some positive utility.

Proposition 1

In the proposed game for noiseless networks, suppose f1(xu,xv)<1,f2(xu,xv)<0 or f1(xu,xv)>1,f2(xu,xv)>0, then choosing the strategy Link with probability:

$$ p^{\star}(\mathbf{x}_{u},\mathbf{x}_{v})=\frac{f_{2}(\mathbf{x}_{u},\mathbf{x}_{v})}{f_{1}(\mathbf{x}_{u},\mathbf{x}_{v})+f_{2}(\mathbf{x}_{u},\mathbf{x}_{v})-1} $$
(1)

is a symmetric mixed-strategy NE.

Proof

It is conspicuous that p(xu,xv)∈(0,1), i.e., p(xu,xv) is a proper probability. Suppose node v selects the strategy Link with probability p(xu,xv). Then, if node u selects Link, its utility is p(xu,xv)+(1−p(xu,xv))f2(xu,xv)=f1(xu,xv)f2(xu,xv)/(f1(xu,xv)+f2(xu,xv)−1). If node u selects Not Link, its utility is pf1(xu,xv) = f1(xu,xv)f2(xu,xv)/(f1(xu,xv)+f2(xu,xv)−1). Consequently, node u is indifferent between the two strategies. Due to the symmetric structure of the game, node v is also indifferent as long as node u is playing the mixed strategy p(xu,xv). Hence, p(xu,xv) is a symmetric mixed-strategy NE. □

Remarks 1

We note that, besides the mixed-strategy NE mentioned in Proposition 1, there also exist other pure strategy NEs. For instance, (Link, Link) and (Not Link, Not Link) are NEs when f1(xu,xv)<1,f2(xu,xv)<0, while (Link,Not Link) and (Not Link,Link) are NEs when f1(xu,xv)>1,f2(xu,xv)>0. However, since our aim is to obtain a non-degenerated link probability for the generative model, we only focus on the mixed-strategy NE.

We assume that two nodes will link with each other if and only if both of them choose the strategy Link. Hence, at the NE, the link probability of the node pair (u,v) is:

$$ H(\mathbf{x}_{u},\mathbf{x}_{v})\triangleq p^{\star}(\mathbf{x}_{u},\mathbf{x}_{v})^{2}=\left(\frac{f_{2}(\mathbf{x}_{u},\mathbf{x}_{v})}{f_{1}(\mathbf{x}_{u},\mathbf{x}_{v})+f_{2}(\mathbf{x}_{u},\mathbf{x}_{v})-1}\right)^{2}. $$
(2)

Different utility functions f1() and f2() lead to different link probability function H(). Two examples of such functions that satisfy the assumption of Proposition 1 are listed as follows.

  • When \(f_{1}(\mathbf {x}_{u},\mathbf {x}_{v})=\sqrt {1-\exp (-\mathbf {x}_{u}^{\texttt {T}}\mathbf {x}_{v})}\) and f2(xu,xv)=−f1(xu,xv), the link probability function is \(H(\mathbf {x}_{u},\mathbf {x}_{v})=1-\exp (-\mathbf {x}_{u}^{\texttt {T}}\mathbf {x}_{v})\), which coincides with the affiliated graph model (AGM) proposed in [15, 24]. The AGM becomes a special case of our game-theoretic model if we choose the link probability function H is in this form.

  • When \(f_{1}(\mathbf {x}_{u},\mathbf {x}_{v})=\sqrt {\frac {\mathbf {x}_{u}^{\texttt {T}}\mathbf {x}_{v}}{1+\mathbf {x}_{u}^{\texttt {T}}\mathbf {x}_{v}}}\) and f2(xu,xv)=−f1(xu,xv), the link probability function is \(H(\mathbf {x}_{u},\mathbf {x}_{v})=\frac {\mathbf {x}_{u}^{\texttt {T}}\mathbf {x}_{v}}{1+\mathbf {x}_{u}^{\texttt {T}}\mathbf {x}_{v}}\).

The above two link probability functions are intuitively reasonable: if nodes u,v share a lot of community affiliations in common, the inner product \(\mathbf {x}_{u}^{\texttt {T}}\mathbf {x}_{v}\) is large, and so is the link probability H(xu,xv). The differences of these two link probability functions lie in their increasing speed with respect to \(\mathbf {x}_{u}^{\texttt {T}}\mathbf {x}_{v}\). Different networks may be suitable for different link probability functions. After every pair of nodes finishes the game and decides whether to form a link or not, the entire network is constructed. Hence, the proposed game-theoretic model is a generative model of the networks.

2.2 Game for noisy networks

The game-theoretic generative process of the noisy networks consists of two stages since, in addition to the generative process for the noiseless networks, we need another stage to take the generation of noise into consideration. The first stage is to determine whether to form a link or not while the second stage is to decide whether to report the truth about the link state. The overall utility is the sum of the utilities obtained in the two stage games. The first stage is the same as the game for the noiseless networks. Thus, we just focus on the second stage, which is specified for a node pair (u,v) as follows.

  • Pure strategies: Truth-telling and Not Truth-telling

  • Mixed strategies: [ 0,1], the probability of Truth-telling

  • Outcome: The true linking state is reported if and only if both nodes adopt strategy Truth-telling.

  • Utility functions: If u,v are linked in the first stage, the utility functions of all possible circumstances are listed in Table 2 (a). Similarly, if u,v are not linked in the first stage, the utility functions are listed in Table 2 (b). The utility functions gi() are all symmetric functions, i.e., gi(xu,xv)=gi(xv,xu),i∈{1,2,3,4}.

    Table 2 Utility table of the second stage in the game for noisy networks

We denote the overall strategy of the formulated two-stage dynamic game as <p,(q1,q2)> where p is probability of the strategy Link in the first stage and (q1,q2) are the probability of the strategy Truth-telling in the second stage given that a link between u,v is formed or not formed in the first stage, respectively.

Proposition 2

In the proposed dynamic game for noisy networks, \(\left < p^{\star },(q_{1}^{\star },q_{2}^{\star })\right >\) given in (3), (4), and (5) is a symmetric mixed-strategy subgame perfect equilibrium (SPE)

$$\begin{array}{@{}rcl@{}} \!\!\!\!\!&&\!\!\!\!\!\!\!\!\!\!\!\!q_{1}^{\star}(\mathbf{x}_{u},\mathbf{x}_{v})=\frac{g_{2}(\mathbf{x}_{u}, \mathbf{x}_{v})}{g_{1}(\mathbf{x}_{u},\mathbf{x}_{v})+g_{2}(\mathbf{x}_{u},\mathbf{x}_{v})-1} \end{array} $$
(3)
$$\begin{array}{@{}rcl@{}} \!\!\!\!\!&&\!\!\!\!\!\!\!\!\!\!\!\!q_{2}^{\star}(\mathbf{x}_{u},\mathbf{x}_{v})=\frac{g_{4}(\mathbf{x}_{u}, \mathbf{x}_{v})}{g_{3}(\mathbf{x}_{u},\mathbf{x}_{v})+g_{4}(\mathbf{x}_{u},\mathbf{x}_{v})-1} \end{array} $$
(4)
$$\begin{array}{@{}rcl@{}} \!\!\!\!\!&&\!\!\!\!\!\!\!\!\!\!\!\!p^{\star}(\mathbf{x}_{u},\mathbf{x}_{v})\,=\,\frac{f_{2}(\mathbf{x}_{u}, \mathbf{x}_{v})}{f_{1}(\mathbf{x}_{u},\mathbf{x}_{v})\,+\,f_{2}(\mathbf{x}_{u},\mathbf{x}_{v})\!\,-\,\!1\!\,-\,\!g_{1} (\mathbf{x}_{u},\mathbf{x}_{v})q_{1}^{\star}(\mathbf{x}_{u},\mathbf{x}_{v})\!\,+\,\!g_{3}(\mathbf{x}_{u},\mathbf{x}_{v}) q_{2}^{\star}(\mathbf{x}_{u},\mathbf{x}_{v})} \end{array} $$
(5)

provided that \(0\leq p^{\star }(\mathbf {x}_{u},\mathbf {x}_{v}),q_{1}^{\star }(\mathbf {x}_{u},\mathbf {x}_{v}),q_{2}^{\star }(\mathbf {x}_{u},\mathbf {x}_{v})\leq 1\).

Proof

According to Proposition 1, the mixed-strategy in the second stage, i.e., \(q_{1}^{\star }\) and \(q_{2}^{\star }\), is an NE at the second stage. To show that \(\left < p^{\star },(q_{1}^{\star },q_{2}^{\star })\right >\) is also a NE at the first stage, we assume that node v uses the strategy \(\left < p^{\star },(q_{1}^{\star },q_{2}^{\star })\right >\). Thus, if node u chooses Link in the first stage, regardless of its strategy in the second stage, its total utility is given in (6).

$$ {\begin{aligned} &p^{\star}(\mathbf{x}_{u},\mathbf{x}_{v})+(1-p^{\star}(\mathbf{x}_{u},\mathbf{x}_{v}))f_{2}(\mathbf{x}_{u},\mathbf{x}_{v})+p^{\star}(\mathbf{x}_{u},\mathbf{x}_{v})g_{1}(\mathbf{x}_{u},\mathbf{x}_{v})q_{1}^{\star}(\mathbf{x}_{u},\mathbf{x}_{v})\\ &+(1-p^{\star}(\mathbf{x}_{u},\mathbf{x}_{v}))g_{3}(\mathbf{x}_{u},\mathbf{x}_{v})q_{2}^{\star}(\mathbf{x}_{u},\mathbf{x}_{v})\\ &=g_{3}(\mathbf{x}_{u},\mathbf{x}_{v})q_{2}^{\star}(\mathbf{x}_{u},\mathbf{x}_{v})\\ &+\frac{f_{1}(\mathbf{x}_{u},\mathbf{x}_{v})f_{2}(\mathbf{x}_{u},\mathbf{x}_{v})}{f_{1}(\mathbf{x}_{u},\mathbf{x}_{v})+f_{2}(\mathbf{x}_{u},\mathbf{x}_{v})-1-g_{1}(\mathbf{x}_{u},\mathbf{x}_{v})q_{1}^{\star}(\mathbf{x}_{u},\mathbf{x}_{v})+g_{3}(\mathbf{x}_{u},\mathbf{x}_{v})q_{2}^{\star}(\mathbf{x}_{u},\mathbf{x}_{v})}. \end{aligned}} $$
(6)

If node u chooses Not Link in the first stage, regardless of its strategy in the second stage, its total utility is given in (7).

$$ {\begin{aligned} &p^{\star}(\mathbf{x}_{u},\mathbf{x}_{v})f_{1}(\mathbf{x}_{u},\mathbf{x}_{v})+g_{3}(\mathbf{x}_{u},\mathbf{x}_{v})q_{2}^{\star}(\mathbf{x}_{u},\mathbf{x}_{v})\\ &=g_{3}(\mathbf{x}_{u},\mathbf{x}_{v})q_{2}^{\star}(\mathbf{x}_{u},\mathbf{x}_{v})\\ &+\frac{f_{1}(\mathbf{x}_{u},\mathbf{x}_{v})f_{2}(\mathbf{x}_{u},\mathbf{x}_{v})}{f_{1}(\mathbf{x}_{u},\mathbf{x}_{v})+f_{2}(\mathbf{x}_{u},\mathbf{x}_{v})-1-g_{1}(\mathbf{x}_{u},\mathbf{x}_{v})q_{1}^{\star}(\mathbf{x}_{u},\mathbf{x}_{v})+g_{3}(\mathbf{x}_{u},\mathbf{x}_{v})q_{2}^{\star}(\mathbf{x}_{u},\mathbf{x}_{v})}. \end{aligned}} $$
(7)

Thus, at first stage, node u is indifferent among all the pure strategies. We see that \(\left < p^{\star },(q_{1}^{\star },q_{2}^{\star })\right >\) is also an NE at the first stage and hence a SPE of the entire dynamic game. □

Denote \(Y(u,v),\hat {Y}(u,v)\) the binary variable representing the true link state and the observed noisy link state between nodes u,v respectively, i.e., “1” represents the presence of a link while “0” represents no link. Then, at the SPE \(\left < p^{\star },(q_{1}^{\star },q_{2}^{\star })\right >\), the link probability of nodes u,v is H(xu,xv)=p(xu,xv)2 while the fake link and missing link probabilities are:

$$\begin{array}{@{}rcl@{}} &&\epsilon_{1}(\mathbf{x}_{u},\mathbf{x}_{v})\triangleq\mathbb{P}\left(\hat{Y}(u,v)=1|Y(u,v)=0\right)\\ &&~~~~~~~~~~~~=1-q_{2}^{*}(\mathbf{x}_{u},\mathbf{x}_{v})^{2},\\ &&\epsilon_{2}(\mathbf{x}_{u},\mathbf{x}_{v})\triangleq\mathbb{P}\left(\hat{Y}(u,v)=0|Y(u,v)=1\right)\\ &&~~~~~~~~~~~~=1-q_{1}^{*}(\mathbf{x}_{u},\mathbf{x}_{v})^{2}. \end{array} $$

Thus, different utility functions lead to different link probabilities and link error probabilities. Specifically, for any link probability function H(), any fake link probability ε1 and any missing link probability ε2, we can achieve them by setting the utility functions in the game model as follows:

$$\begin{array}{@{}rcl@{}} && f_{1}(\mathbf{x}_{u},\mathbf{x}_{v})=(1+\epsilon_{1}-\epsilon_{2})\sqrt{H(\mathbf{x}_{u},\mathbf{x}_{v})}, \end{array} $$
(8)
$$\begin{array}{@{}rcl@{}} && g_{1}(\mathbf{x}_{u},\mathbf{x}_{v})=\sqrt{1-\epsilon_{2}},\ g_{3}(\mathbf{x}_{u},\mathbf{x}_{v})=\sqrt{1-\epsilon_{1}}, \end{array} $$
(9)
$$\begin{array}{@{}rcl@{}} && f_{2}(\mathbf{x}_{u},\mathbf{x}_{v})=-f_{1}(\mathbf{x}_{u},\mathbf{x}_{v}), \end{array} $$
(10)
$$\begin{array}{@{}rcl@{}} &&g_{2}(\mathbf{x}_{u},\mathbf{x}_{v})=-g_{1}(\mathbf{x}_{u},\mathbf{x}_{v}), \end{array} $$
(11)
$$\begin{array}{@{}rcl@{}} &&g_{4}(\mathbf{x}_{u},\mathbf{x}_{v})=-g_{3}(\mathbf{x}_{u},\mathbf{x}_{v}). \end{array} $$
(12)

Thus, by properly tuning the utility functions as above, the game-theoretic framework can model a general class of generative processes of networks with community structure. In the game-theoretic model, each node pair links with each other with probability H(xu,xv) and then each link state Y(u,v) flips with probability ε1 and ε2, producing the observed networks \(\hat {Y}(u,v)\).

3 A general community detection algorithm for noisy networks

In this section, a community detection algorithm for the game-theoretic generative model is derived. Since noiseless networks simply correspond to noisy networks with ε1=ε2=0, we only focus on community detection in noisy networks from now on in this section. The game-theoretic model of noisy networks can be represented by three elements: the link probability function H(xu,xv), the fake link probability ε1, and the missing link probability ε2, i.e., a triple <H(xu,xv),ε1,ε2>. We assume that the link error probabilities ε1 and ε2 are constants independent of the affiliation strength xu. The reason of this assumption is that the link error probabilities are related to the accuracy of the data acquisition technology, which is independent of the community structure of the networks.

A graphical representation of the proposed game-theoretic generative model for noisy networks is shown in Fig. 2. For each pair of users u,v with community affiliation strength xu,xv, a link between them is formed with probability H(xu,xv). The link state Y(u,v) can be either “1” (linking) or “0” (not linking), with linking probability H(xu,xv), i.e.,

$$ Y(u,v)\sim\texttt{Bernoulli}(H(\mathbf{x}_{u},\mathbf{x}_{v})). $$
(13)
Fig. 2
figure 2

Graphical illustration of the proposed game-theoretic generative model

Afterwards, noise is added in so that the link state Y(u,v) is flipped with fake link probability ε1 and missing link probability ε2 to generate the observed link state \(\hat {Y}(u,v)\), i.e.,

$$ \hat{Y}(u,v)\sim\texttt{Bernoulli}\left(\epsilon_{1}^{1-Y(u,v)}(1-\epsilon_{2})^{Y(u,v)}\right). $$
(14)

We assume that the link error probabilities ε1,ε2 are known. Our goal is to infer the unknown community affiliation strength \(\mathbf {X}\triangleq \{\mathbf {x}_{u}\}_{u=1}^{N}\), based on which we can do community detection.

According to the generative model, the joint probability distribution function (PDF) of the true network \(\mathbf {Y}\triangleq \{Y(u,v)\}_{u,v=1,u< v}^{N}\) and the observed noisy network \(\hat {\mathbf {Y}}\triangleq \{\hat {Y}(u,v)\}_{u,v=1,u< v}^{N}\) is:

$$ p\left(\left.\mathbf{Y},\hat{\mathbf{Y}}\right|\mathbf{X}\right)\,=\,\prod_{u< v}\left[p\left(\left.\hat{Y}(u,v)\right|Y(u,v)\right)p\left(Y(u,v)|\mathbf{x}_{u},\mathbf{x}_{v}\right)\right], $$
(15)

while the marginal PDF of the observation \(\hat {\mathbf {Y}}\):

$$ \begin{aligned} &p\left(\left.\hat{\mathbf{Y}}\right|\mathbf{X}\right)\\ &\!\!\!=\prod_{u< v}\sum_{Y(u,v)\in\{0,1\}}p\left(\left.\hat{Y}(u,v)\right|Y(u,v)\right)p(Y(u,v)|\mathbf{x}_{u},\mathbf{x}_{v}). \end{aligned} $$
(16)

Hence, the maximum likelihood estimate (MLE) of the community affiliation strength parameter X can be calculated as:

$$ \mathbf{X}_{\text{ML}}=\arg\max_{\mathbf{X}}\log p\left(\left.\hat{\mathbf{Y}}\right|\mathbf{X}\right). $$
(17)

However, due to the existence of the latent variables Y (the true network), the maximization problem for the MLE is hard to solve: there is summation (marginalization) inside the logarithm, which cannot be operated directly onto the joint distribution. We thus resort to the expectation maximization (EM) algorithm [25], an efficient algorithm iterating between two steps, i.e., the expectation step (E-step) and the maximization step (M-step). Now, we proceed to derive an EM algorithm for the proposed generative model.

3.1 Derivation of the e-step

The joint PDF of the true link state Y(u,v) and the observed noisy link state \(\hat {Y}(u,v)\) is:

$$ \begin{aligned} &p\left(\left.Y(u,v),\hat{Y}(u,v)\right|\mathbf{x}_{u},\mathbf{x}_{v}\right)\\ &=p\left(Y(u,v)|\mathbf{x}_{u},\mathbf{x}_{v}\right)p\left(\left.\hat{Y}(u,v)\right|Y(u,v)\right)\\ &=\epsilon_{1}^{(1-Y(u,v))\hat{Y}(u,v)}(1-\epsilon_{1})^{(1-Y(u,v))(1-\hat{Y}(u,v))}\\ &~~~\epsilon_{2}^{Y(u,v)(1-\hat{Y}(u,v))}(1-\epsilon_{2})^{Y(u,v)\hat{Y}(u,v)}\\ &~~~H(\mathbf{x}_{u},\mathbf{x}_{v})^{Y(u,v)}(1-H(\mathbf{x}_{u},\mathbf{x}_{v}))^{1-Y(u,v)} \end{aligned} $$
(18)

Suppose we have an estimate of the community affiliation strength matrix Xold which we would like to update. Based on (18), the posterior distribution of the latent variable Y(u,v) is given as (19).

$$ {\begin{aligned} &q_{uv}\triangleq p\left(Y(u,v)=1\left|\hat{Y}(u,v),\mathbf{x}_{u}^{\text{old}},\mathbf{x}_{v}^{\text{old}}\right.\right)\\ &=\frac{p\left(Y(u,v)=1,\hat{Y}(u,v)\left|\mathbf{x}_{u}^{\text{old}},\mathbf{x}_{v}^{\text{old}}\right.\right)}{p\left(Y(u,v)=1,\hat{Y}(u,v)\Big|\mathbf{x}_{u}^{\text{old}},\mathbf{x}_{v}^{\text{old}}\right)+p\left(\left.Y(u,v)=0,\hat{Y}(u,v)\right|\mathbf{x}_{u}^{\text{old}},\mathbf{x}_{v}^{\text{old}}\right)} \end{aligned}} $$
(19)

Thus, we can derive the objective function in the M-step, i.e., the expected complete data log-likelihood, as follows:

$$\begin{array}{*{20}l} &Q(\mathbf{X},\mathbf{X}^{\text{old}})=\mathbb{E}_{\mathbf{Y}|\hat{\mathbf{Y}},\mathbf{X}^{\text{old}}}\left[\log p\left(\left.\mathbf{Y},\hat{\mathbf{Y}}\right|\mathbf{X}\right)\right]\\ &=\sum_{1\leq u< v\leq N}\mathbb{E}_{\mathbf{Y}|\hat{\mathbf{Y}},\mathbf{X}^{\text{old}}}\left[\log p\left(Y(u,v),\hat{Y}(u,v)\left|\mathbf{x}_{u},\mathbf{x}_{v}\right.\right)\right]\\ &=\sum_{1\leq u< v\leq N}\left\{q_{uv}\left[\left(1\,-\,\hat{Y}(u,v)\right)\log\epsilon_{2}\,+\,\hat{Y}(u,v)\log(1-\epsilon_{2})\right.\right.\\ &~~~\left.+\log H(\mathbf{x}_{u},\mathbf{x}_{v})\right]+(1-q_{uv})\left[\hat{Y}(u,v)\log\epsilon_{1}\right.\\ &~~~\left.\left.+\left(1-\hat{Y}(u,v)\right)\log(1-\epsilon_{1})+\log(1-H(\mathbf{x}_{u},\mathbf{x}_{v}))\right]\right\}. \end{array} $$
(20)

3.2 Derivation of the m-step

In the M-step, we maximize the expected complete-data log likelihood. In other words, we want to solve the following optimization problem:

$$ \begin{aligned} &\text{maximize}~~~Q(\mathbf{X},\mathbf{X}^{\text{old}})\\ &~~~~\text{s.t.}~~~~~~~\mathbf{X}\geq\mathbf{0}, \end{aligned} $$
(21)

where the matrix inequality stands for componentwise inequalities. We note that only two terms in the objective function (20) depend on the optimization variable X. So, the problem can be equivalently written as:

$$ \begin{aligned} &\text{maximize}~~~J(\mathbf{X})\triangleq\sum_{1\leq u< v\leq N}\left[q_{uv}\log H(\mathbf{x}_{u},\mathbf{x}_{v})\right.\\ &~~~~~~~~~~~~~~~~~~~~~~~~~\left.+(1-q_{uv})\log(1-H(\mathbf{x}_{u},\mathbf{x}_{v}))\right]\\ &~~~~\text{s.t.}~~~~~~~\mathbf{X}\geq\mathbf{0}. \end{aligned} $$
(22)

The gradient of J with respect to xu is:

$$ \nabla_{\mathbf{x}_{u}}J=\sum_{v\neq u}\frac{q_{uv}-H(\mathbf{x}_{u},\mathbf{x}_{v})}{H(\mathbf{x}_{u},\mathbf{x}_{v})(1-H(\mathbf{x}_{u},\mathbf{x}_{v}))}\nabla_{\mathbf{x}_{u}}H(\mathbf{x}_{u},\mathbf{x}_{v}). $$
(23)

A projected coordinate ascent algorithm is utilized to solve the optimization problem (22). Each time we only optimize J with respect to one single vector xu using gradient ascent while keeping other vectors xv (vu) fixed. After each iteration, we project the updated xu onto the nonnegative orthant to meet the nonnegative constraint.

The EM iterations are known to converge to some locally maximum point of the likelihood function [25]. As such, we iterate between the E-step and the M-step until convergence. After the estimate of the community affiliation strength X is obtained, a threshold is needed to decide the hard community affiliation, i.e., whether a node belongs to a community or not. Denote Z∈{0,1}K×N the community affiliation matrix, whose (k,n) entry is 1 if node n belongs to community k. Denote \(\mathbf {e}_{1}\in \mathbb {R}^{K}\) the vector with first entry equal to 1 and remaining entries equal to 0. One reasonable threshold is the solution t of the equation H(te1,te1)=α, where α is the background edge probability, i.e., the total number of links in the graph divided by the total number of links in a complete graph with N nodes. We judge that node u belongs to community k, i.e., z(k,n)=1, if xu(k) is larger than t. The overall community detection algorithm is summarized in Algorithm 1. From Algorithm 1, we can see that the computational complexity of the E-step and M-step is O(N2). We note that the proposed algorithm is general in the sense that we have not specified the concrete form of the link probability function H(xu,xv) yet. Several possible forms of the link probability function are listed as follows:

$$\begin{array}{@{}rcl@{}} &&H(\mathbf{x}_{u},\mathbf{x}_{v})=1-\exp\left\{-\left(\mathbf{x}_{u}^{\texttt{T}}\mathbf{x}_{v}\right)^{\theta}\right\}, ~\theta>0, \end{array} $$
(24)
$$\begin{array}{@{}rcl@{}} &&H(\mathbf{x}_{u},\mathbf{x}_{v})=1-\exp\left(-\mathbf{x}_{u}^{\texttt{T}}\mathbf{R} \mathbf{x}_{v}\right), \end{array} $$
(25)
$$\begin{array}{@{}rcl@{}} &&H(\mathbf{x}_{u},\mathbf{x}_{v})=\frac{\mathbf{x}_{u}^{\texttt{T}}\mathbf{x}_{v}}{1+\mathbf{x}_{u}^{\texttt{T}}\mathbf{x}_{v}}, \end{array} $$
(26)

where R is some symmetric and nonnegative matrix. Both (24) and (26) are to detect cohesive communities, where intra-community link density is much higher than the inter-community one. The link probability function (25) can model more flexible community structures, e.g., intra-community link density is lower than the inter-community one or some communities link with each other more often while some not. The shortcoming of (25) is we need a priori knowledge about the community structure in order to determine the structure of the matrix R in (25).

4 Simulations and real-data experiments

In this section, synthetic data-based simulations as well as real-data-based experiments are conducted to validate the proposed community detection algorithm for the game-theoretic generative model.

4.1 Simulations

To implement simulations, we synthesize networks with N nodes and K communities according to the following procedure:

  1. 1.

    Partition all nodes into K non-overlapping equal groups of nodes so that each group has N/K nodes.

  2. 2.

    For each group, randomly pick ηN/K nodes outside of the group and add these nodes into the group, where 0<η<1 is a user-defined parameter.

  3. 3.

    Each group is defined to be a community. Choose some community affiliation strength for nodes in the community. This strength will influence the edge density of the networks.

  4. 4.

    Generate the links according to the chosen link probability function H(xu,xv).

  5. 5.

    Add noise into the network according to the link error probabilities ε1,ε2.

The networks generated in this way have overlapping community structure. Actually, on average, for each community, a proportion of 2η/(1+η) nodes in the community also belong to other communities. The parameter setup for the simulation is as follows. We set N=100, K=2,3, η=0.1,0.2,0.3. For link error probabilities, we select ε1=0.005 and ε2=0.1,0.2,0.3. The reason is that in practical networks, most of the link errors are missing links (incomplete graphs) instead of fake links. For link probability function, we choose \(H(\mathbf {x}_{u},\mathbf {x}_{v})=1-\exp (-\mathbf {x}_{u}^{\texttt {T}}\mathbf {x}_{v})\) and compare the performance with that of the affiliated graph model (AGM) proposed in [15]. A visualization of the community detection results of the proposed method and AGM, a state-of-the-art community detection algorithm with brilliant performance, for a synthetic network is presented in Fig. 3. There are two communities in the network, i.e., community 1 and community 2, whose detection results are shown respectively. We observe that the proposed method outperforms AGM, especially in community 2 where many undetected nodes (green nodes) of AGM becomes detected (red nodes) in the proposed approach.

Fig. 3
figure 3

Synthetic network with missing link probability ε2=0.3: comparison of the two detected communities with the ground-truth by using the proposed noise-aware game-theoretic algorithm and the AGM in [15], respectively. Red nodes: belonging to the community and detected as in the community; blue nodes: not belonging to the community and detected as not in the community; green nodes: belonging to the community but detected as not in the community; black nodes: not belonging to the community but detected as in the community. There happens to be no black node in this network instance. a Detection of community 1 with AGM; b Detection of community 1 with the proposed noise-aware game-theoretic algorithm; c Detection of community 2 with AGM; d Detection of community 2 with the proposed noise-aware game-theoretic algorithm

For a detected community \(\mathcal {C}\) and a ground-truth community \(\bar {\mathcal {C}}\), the Balanced Error Rate (BER) between the two communities is defined to be:

$$ \texttt{BER}(\mathcal{C},\bar{\mathcal{C}})=\frac{1}{2}\left(\frac{|\mathcal{C}\backslash\bar{\mathcal{C}}|}{|\mathcal{C}|}+\frac{|\bar{\mathcal{C}}\backslash \mathcal{C}|}{|\bar{\mathcal{C}}|}\right). $$
(27)

For every detected community \(\mathcal {C}\), we calculate \(\min _{\bar {\mathcal {C}}}\texttt {BER}(\mathcal {C},\bar {\mathcal {C}})\). For every ground-truth community \(\bar {\mathcal {C}}\), we calculate \(\min _{\mathcal {C}}\texttt {BER}(\mathcal {C},\bar {\mathcal {C}})\). Then, the performance metric is the average of all these minimum BER’s. The simulation results for different number of communities and different community overlapping extent are shown in Fig. 4, where we compare the proposed noise-aware game-theoretic algorithm with the AGM in [15]. We find that the proposed algorithm always outperforms the AGM, and the performance enhancement increases with the noise level ε2 (except for networks in Fig. 4f).

Fig. 4
figure 4

Comparison between the proposed noise-aware game-theoretic community detection algorithm and the AGM method in [15]. a η=0.1,K=2; b η=0.2,K=2; c η=0.3,K=2; d η=0.1,K=3; e η=0.2,K=3; f η=0.3,K=3

4.2 Real-data experiments

For real-data experiments, we consider two datasets: the Facebook ego-networks dataset [26] and the DBLP collaboration network dataset [27]. Both networks have well-defined ground-truth communities. The detailed statistics about the datasets are listed as follows.

  • Facebook ego-networks: number of nodes = 4039, number of edges = 88234. Each node is a Facebook user. Two users are linked if they are Facebook friends. The ground-truth communities are identified by humans manually.

  • DBLP collaboration network: number of nodes = 31708, number of edges = 1049866. Each node is an author. Two authors are linked if they have co-authored at least one paper together. The publication venue defines the ground-truth communities.

To control the size of the input network to the community detection algorithm, we sample the original network to obtain smaller subnetworks, on which we perform the community detection [15]. Specifically, we randomly select one node belonging to at least two communities and the subnetwork consists of all nodes with at least one common community with the selected node. Furthermore, we add noise onto the networks with ε1=0.005,ε2=0.1,0.2,0.3,0.4. For link probability function, we still choose \(H(\mathbf {x}_{u},\mathbf {x}_{v})=1-\exp (-\mathbf {x}_{u}^{\texttt {T}}\mathbf {x}_{v})\). Visualizations of the detection results of a Facebook ego-network and a DBLP network are shown in Figs. 5 and 6, respectively. Both networks have two communities. Similar to the synthetic network in Fig. 3, we remark that the proposed approach still outperforms AGM especially for the community 1 in Facebook and DBLP networks. The relative improvement of the proposed noise-aware game-theoretic algorithm over the AGM is listed in Table 3. Again, the proposed algorithm always outperforms the AGM and the performance improvement increases with the noise level ε2.

Fig. 5
figure 5

Facebook ego-network with missing link probability ε2=0.3: comparison of the two detected communities with the ground-truth by using the proposed noise-aware game-theoretic algorithm and the AGM in [15], respectively. The nodes’ colors have the same meaning as in Fig. 3. a Detection of community 1 with AGM; b Detection of community 1 with the proposed noise-aware game-theoretic algorithm; c Detection of community 2 with AGM; d Detection of community 2 with the proposed noise-aware game-theoretic algorithm

Fig. 6
figure 6

DBLP network with missing link probability ε2=0.3: comparison of the two detected communities with the ground-truth by using the proposed noise-aware game-theoretic algorithm and the AGM in [15], respectively. The nodes’ colors have the same meaning as in Fig. 3. a Detection of community 1 with AGM; b Detection of community 1 with the proposed noise-aware game-theoretic algorithm; c Detection of community 2 with AGM; d Detection of community 2 with the proposed noise-aware game-theoretic algorithm

Table 3 Relative enhancement of the proposed noise-aware game-theoretic algorithm over the AGM on real-world datasets

We further investigate the impact of the selection of link probability function H(xu,xv) on the performance. The performance of the function \(H(\mathbf {x}_{u},\mathbf {x}_{v})=1-\exp (-\mathbf {x}_{u}^{\texttt {T}}\mathbf {x}_{v})\) serves as a benchmark. Additionally, we select three different link probability functions and study their relative community detection accuracy improvements on the Facebook and DBLP datasets. The results are shown in Table 4, where the matrix R in the last function is the block diagonal matrix R=diag(R0,R0,…,R0), with

$$\begin{array}{@{}rcl@{}} \mathbf{R}_{0}=\left[ \begin{aligned} &&\!\!\!\!\!0~~~1\\ &&\!\!\!\!\!1~~~0 \end{aligned}\right]. \end{array} $$
(28)
Table 4 Relative improvement of the community detection accuracy with different link probability functions H(xu,xv)

We note this link probability function is suitable for detecting community structure with inter-community links denser than intra-community links. The results indicate that different choices of link probability function lead to different performances and the performance variations depend on the datasets. Specifically, the performance degradation of using the link probability function (25) suggests that in Facebook and DBLP dataset, the intra-community links are denser than inter-community links. We also study the Chesapeake and Florida Bay foodweb network [28], where the inter-community links are denser than intra-community links (there are lots of links between a group of predators and the corresponding prey group). Thus, we utilize the link probability function in (25), where the matrix R is set to be R0. Then, the link density between the two detected communities, i.e., the ratio between the number of links and the number of possible links in a complete graph, is 0.590, while that of the entire graph is only 0.223. The two detected communities are depicted in Fig. 7. We observe that there are lots of links between the two communities while only few links exist within each community. So, the detected community structure correctly characterizes the predator-prey relationship in the network.

Fig. 7
figure 7

The detected two communities in the Chesapeake and Florida Bay foodweb network. Blue nodes and green nodes represent two communities, respectively. The red nodes correspond to the intersection of the two communities

5 Conclusion

A game-theoretic analysis of the community detection problem in both noiseless networks and noisy networks has been presented, which takes nodes’ rational decision making into account. The equilibria of the formulated game lead to a probabilistic generative model of networks with community structure. Based on the game-theoretic model, we propose a general community detection algorithm by using an EM algorithm. The effectiveness of the proposed algorithm is validated by simulations as well as real-data experiments. We hope that this paper can open a new direction to look at the community detection problem from the microeconomic perspective.