Introduction

Many real-world systems can be modeled by a network G=(V,E) of interacting actors. The actors are demonstrated by a set of nodes V with cardinality |V| that are connected via the set of edges (links) E with cardinality |E|. The edges can be directed or undirected, depending on the type of interactions. Some well-known examples of such real-world systems include technological and transportation infrastructures, communication systems, biological networks, and social interactions. Centrality measures are means of quantifying the importance of a node within the given network. Some notions of centrality only consider local properties of the network; however, some of them reflect global properties. Proper quantification of importance should be done given the application context. To address applications in which reachability of a node to the entire network is of importance, researchers have introduced the closeness centrality measure. For an arbitrary node u, its closeness centrality C(u) is defined as the inverse of its average distance to the other nodes in the network. More formally:

$$\begin{array}{@{}rcl@{}} C(u)=\frac{|V|-1}{{\sum}_{v\neq u \in V} d(u,v)} \end{array} $$
(1)

where d(u,v) is the shortest distance between u and v. Locating public facilities over a transportation network such that they are easily accessible to everyone or identifying people with ideal social network location for information dissemination or network influence can be mentioned as scenarios in which identifying high closeness centralities is of great interest (Saxena et al. 2017; Taheri et al. 2017a, b). In these scenarios, we are mainly interested in efficiently and accurately detecting top-k high closeness centrality nodes in the network, while their exact relative order compared to each other, as well as the actual closeness centrality values, are not so important.

A trivial approach to identify top-k closeness centrality nodes consists of the following steps: (1) Utilizing breadth-first search (BFS) for calculating closeness centrality for each node in O(|V|+|E|) with a total computational cost of O(|V||E|+|V|2); (2) Sorting the computed values via a sorting algorithm in O(|V| log(|V|)), then report the top-k nodes. The high computational cost of O(|V||E|+|V|2) and the requirement of full knowledge of the network topology may prevent such a method from being applied on large real-world networks (Wehmuth and Ziviani 2012). To address this issue, developing scalable distributed algorithms is of great importance, where each node is only interacting with its immediate neighbors (You et al. 2017).

To the best of our knowledge, there is no distributed and decentralized algorithm for the task of detecting top-k high closeness centrality nodes that operates while requiring each node only to have local interactions with its immediate neighbors. However, several algorithms are satisfying these properties and compute exact or approximated closeness centrality of each node in the network. Approximation approaches compute an alternative centrality score that highly correlates with the global closeness centrality. An efficient sorting algorithm can then be utilized on top of these methods to identify top-k high closeness centrality nodes. There are two major shortcomings with such approaches: (1) Not exploiting the fact that the vector consisting of closeness centrality values has a few large coefficients (k) and many small coefficients so that it can be well approximated by a k-sparse vector (signal). In general, a centrality measure (e.g. closeness centrality) must have a right-skewed probability distribution to be useful in selecting important nodes. (2) Requiring direct measurement (query) from each node, which is not always possible due to log-in requirements, API query limits, and treating user data as proprietary.

To address these issues, we transform the problem of detecting top-k closeness central nodes to the problem of sparse recovery in networks. The breakthrough of the sparse recovery problem is compressive sensing (aka compressive sampling) which performs a few indirect end-to-end measurements on a signal x and recovers a good sparse approximation of that signal. However, two additional requirements must be taken into account when these measurements are performed over a graph, rather than an arbitrary signal. Creating feasible measurements that satisfy these constraints (will be discussed in “Compressive Sensing over Networks” section) has initiated the field of compressive sampling over graphs.

Our contributions in this paper are two-fold: (1) We propose a local (ego-centric) metric which can be computed in a distributed manner at each node. The computation can be carried out requiring each node to have only local knowledge of its immediate neighborhood. In “Experimental Evaluation” section, we experimentally show that the suggested local metric is highly correlated with the global closeness centrality on many real-world and synthetic networks. (2) We propose a general compressive sensing framework for distributed identification of central nodes in networks based on the introduced local metric using indirect end-to-end (aggregated) measurements. We experimentally show the superiority of our approach in terms of accuracy for the prediction of high closeness central nodes compared to the best existing competing methods.

The rest of this paper is organized as follows. In “Preliminaries” section, we briefly explain the preliminary notations and definitions. We review the related works on distributed detection of central nodes requiring only local interactions with the neighbors from each node, in “Related Work” section. In “Proposed Method” section, we introduce our novel approach in detail and analyze its time and space complexity. Later in “Experimental Evaluation” section, the settings and results of our experimental evaluations are presented. We conclude the paper in “Conclusion” section.

A preliminary version of this paper has appeared in (Mahyar et al. 2018). Here, we explain the backgrounds and the intuitions behind the idea in more details. Also, we comprehensively review the related work and describe their limitations with our corresponding solutions. Moreover, we add three different types of real datasets and several test scenarios to our extensive experimental evaluations to show the generalization of the proposed method.

Preliminaries

Compressive Sampling

As an alternative to direct measurements, one can utilize sampling-based approaches. Based on the Nyquist-Shannon theorem, a general signal x can be completely recovered by sampling it with the Nyquist rate. However, sampling with the Nyquist rate can be costly or impossible due to a massive scale in many real-world networks we are facing today. If the underlying signal is sparse in a suitable basis, sampling with the Nyquist rate only to recover a relatively small fraction of non-zero elements results in loss of system resources and induces two sources of error, sampling (collection) error and identification (compression) error.

The state-of-the-art approach for recovery of sparse signals is Compressive Sensing/Sampling (CS) which addresses these drawbacks. In compressive sampling, one can simultaneously sample and compress a signal xn×1 through a measurement matrix \(\mathcal {A}_{m\times n}\) where mn to acquire the following linear system:

$$\begin{array}{@{}rcl@{}} y_{m\times 1}=\mathcal{A}_{m\times n}~x_{n\times 1} \end{array} $$
(2)

The resulting system is under-determined and does not have a unique solution in general. \(\mathcal {A}\) is said to satisfy the 2k-restricted isometry property (RIP) if there exists 0<δ2k<1, such that for all 2k-sparse signals x, it holds:

$$\begin{array}{@{}rcl@{}} (1-\delta_{2k})||x^{\prime}||_{2} \leq ||Ax^{\prime}||_{2} \leq (1+\delta_{2k})||x^{\prime}||_{2} \end{array} $$
(3)

In case the measurement matrix satisfies the 2k-RIP one can prove uniqueness of a k-sparse solution to the above linear system (\(y=\mathcal {A}x\)). To see this, assume x1 and x2 are both k-sparse signals and \(\mathcal {A}x_{1}=\mathcal {A}x_{2}\), so vector x=x1x2 is a 2k-sparse signal (has at most 2k non-zero entries). Since \(\mathcal {A}\) satisfies the 2k-RIP, Eq. (3) can be rewritten for some \(0<\delta ^{\prime }_{2k}<1\) which ensures x1=x2, as:

$$\begin{array}{@{}rcl@{}} (1-\delta^{\prime}_{2k})||x_{1}-x_{2}||_{2} \leq 0 \leq (1+\delta^{\prime}_{2k})||x_{1}-x_{2}||_{2} \end{array} $$
(4)

Let x be any arbitrary k-sparse vector, and \(\mathcal {A}\) be an arbitrary measurement matrix that satisfies the 2k-RIP property. Then given what we have discussed so far, it is easy to see that x can be recovered by solving:

$$\begin{array}{@{}rcl@{}} \min_{x} \Vert x \Vert_{0} ~~~\text{s.t.}~~~ y = \mathcal{A} x \end{array} $$
(5)

where ∥x0 indicates the number of non-zero entries in x. Unfortunately, solving this optimization problem is NP-hard. Thus the following relaxation is considered which utilizes the sparsity inducing 1-norm and is referred to as Basis Pursuit (BP):

$$\begin{array}{@{}rcl@{}} \min_{x} \Vert x \Vert_{1} ~~~\text{s.t.}~~~ y = \mathcal{A} x \end{array} $$
(6)

It has been shown when the 2k-restricted isometry is satisfied for \(\mathcal {A}\), the solution of BP is x. In this case, by utilizing the convexity of BP, the recovery is very efficient and computationally fast. Note that the strict condition \(y=\mathcal {A}x\) within the Basis Pursuit formulation is very sensitive to imperfect sparsity or noise. The following formulation, known as LASSO, addresses this by removing the exact constraint and penalizing its violation:

$$\begin{array}{@{}rcl@{}} \min_{x} \Vert x \Vert_{1} + \Vert \mathcal{A} x - y \Vert_{2}^{2} \end{array} $$
(7)

This objective has extremely fast distributed numerical solvers and will be utilized for the optimization step in this paper.

Compressive Sensing over Networks

In case the signal to be recovered is defined over a graph (network), three additional constraints must be taken into account (Xu et al. 2011; Mahyar et al. 2013a) in CS problems: (1) Each element \(\mathcal {A}_{i,j}\) would be 1 if the node j is visited by measurement i and 0 otherwise; (2) The nodes visited by a measurement must correspond to a connected induced sub-graph (Ghalebi et al. 2017; Mahyar et al. 2015b, 2018a, 2017); (3) The signal x which contains a graph property, defined for each node, is almost always non-negative (x≥0).

Based on the compressive sensing framework, we would like to efficiently recover k highest closeness centrality nodes from m indirect end-to-end measurements, in a way that mn. In the linear system \(y_{m \times 1} = \mathcal {A}_{m \times n} ~x_{n \times 1}\), let \(\mathcal {A}\) be an m×n measurement matrix, where its i-th row corresponds to the i-th feasible measurement. For i=1,…,m and j=1,…,n, \(\mathcal {A}_{ij} = 1\) if and only if node j is visited by the i-th measurement, otherwise \(\mathcal {A}_{ij} = 0\). Let x be an n×1 non-negative vector whose j-th entry is the value of a certain type of network characteristic (e.g. a global/local centrality metric) over node jV, and \(y \in \mathcal {R}^{m}\) denotes the measurements vector whose i-th entry represents the additive aggregation values of network nodes in the i-th row of the measurement matrix \(\mathcal {A}\) that induces a connected sub-graph over G. Note that this way of measurements construction already satisfies the network topological constraints of the feasibility conditions mentioned in the beginning of this section.

For the example network shown in Fig. 1 with n=10 nodes and |E|=11 links, each of two measurements m1 and m2 includes a different subset of connected nodes. The corresponding feasible measurement matrix \(\mathcal {A}\) with these measurements is:

$$ \begin{aligned} v_{1} \quad v_{2} \quad v_{3} \quad v_{4} \quad v_{5} \quad v_{6} \quad v_{7} \quad v_{8} \quad v_{9} \quad v_{10} \\ \mathcal{A} = \begin{array}{c} m_{1}\\ m_{2} \end{array} \left(\begin{array}{cccccccccc} 1\quad & \quad1 & \quad1 & \quad0 & \quad0 & \quad1 & \quad1 & \quad1 & \quad0 & \quad0 \\ 0\quad & \quad0 & \quad1 & \quad1 & \quad1 & \quad0 & \quad0 & \quad0 & \quad1 & \quad1 \\ \end{array}\right) \end{aligned} $$
(8)
Fig. 1
figure 1

A network with 10 nodes and 11 links. The measurements m1 and m2 are feasible considering the network topological constraints (each of them induces a connected sub-graph over the network)

To understand how the additive aggregation over connected induced sub-graphs is motivated for each measurement in practice, we mention an example from (Wang et al. 2012). Consider a network where the nodes represent sensors, and the links represent communications between sensors. For the set T of active nodes within an arbitrary feasible measurement that induce a connected sub-graph, a node uT monitors the total values corresponding to nodes in T. Every node in T obtains values from its children, if any, and aggregates them with its value on the spanning tree rooted at u, then sends the sum to its parent. After that, the fusion center can obtain the sum of values corresponding to all the nodes in T by only communicating with u. The explained paradigm in data acquisition and aggregation is highly utilized within the wireless sensor network literature for applications such as air quality monitoring, volcanic activity detection, and object localization (Middya et al. 2017). Some recent work has applied a similar acquisition and aggregation paradigm in network tomography (Mahyar et al. 2013a), community detection (Mahyar et al. 2015b) and finding key actors in social networks (Mahyar 2015; Mahyar et al. 2015a; Grosu et al. 2018).

Based on the above idea, a straight forward approach utilized in practice to construct measurement matrices satisfying these properties, is to create a correspondence between every single measurement and a random walk on the graph. Each random walk additively aggregates values computed by the nodes during the walk. The random walk strategy and the values computed by the nodes are what separate a method from the others. Performance of these methods and RIP satisfaction can then be verified theoretically or experimentally (Mahyar et al. 2018; Mahyar 2015; Mahyar et al. 2018a; Xu et al. 2011). An alternative approach (Mahyar et al. 2018b) employs a well-known randomized method in compressive sensing literature which satisfies the restricted isometry property with very high probability and makes deriving theoretical recovery guarantees straightforward. Also, it is possible to show that each constructed measurement will almost surely correspond to an induced connected sub-graph.

Related Work

In this section, we first review local metrics that highly correlate with the global closeness centrality and can be computed in a distributed manner relying only on interactions of neighboring nodes. After that, we review compressive sensing (CS)-based methods that can be utilized to recover top-k central nodes, using the mentioned local metrics by constructing a feasible measurement matrix.

Local Closeness Metrics

Dist-Exact (You et al. 2017): They proposed a distributed method to compute and collect the set of nodes with an exact distance of h from an arbitrary node u. The parameter h varies from 1 to \(\mathcal {D}\), where \(\mathcal {D}\) denotes the diameter of the network. The collected sets can then be utilized to compute the closeness centrality at each node.

Dist-Est (Wang and Tang 2015): They derived a set of affine constraints which are distributed in nature and characterize closeness centrality according to its original definition. The derived constraints are used to develop an algorithm, which enables nodes in a network to cooperatively estimate their closeness centrality.

DACCER (Wehmuth and Ziviani 2012): Let volh(u) denote the sum of degrees for all nodes in the h-hop neighborhood of u. In this work, the authors showed a high correlation between volh(u),∀uV and the closeness centrality distribution for h>0. The correlation is shown to become stronger as h grows.

Weight-Vol (Kim and Yoneki 2012): This work was an extension to the metric in DACCER, based on two simple observations. First, closer nodes to a node have more contributions than farther nodes in the dissemination of the node’s information. Second, the nodes with low clustering coefficients are hubs linking neighboring network parts.

CS-based Methods for Data Aggregation

RW (Xu et al. 2011): This work is one of the state-of-the-art methods in compressive sensing over graphs that constructs random-walk based measurements. Each measurement in the measurement matrix can be used to aggregate a metric of choice additively.

TopCent (Mahyar 2015): This method constructs a measurement matrix to recover top-k degree central nodes in networks. Since degree centrality is highly correlated with the closeness centrality in some real-world networks, this method is expected to perform well for the task of detecting closeness centralities, as well.

DICeNod (Mahyar et al. 2018b): This approach does not perform walks to create a measurement matrix; instead, it utilizes a well-known randomized matrix construction technique in compressive sensing. They showed that the constructed measurements correspond to induced connected sub-graphs in networks with high probability.

Proposed Method

In this section, we introduce the proposed framework in the following steps: (1) defining a new ego-centric centrality measure; (2) introducing a subroutine, called CS-HICLOSE-SCORECOMPUTE, which calculates the proposed ego-centric centrality metric in a distributed and decentralized manner; (3) introducing a subroutine, called CS-HICLOSE-AGGREGATE, which aggregates the local scores via decentralized measurements construction in compressive sensing. This will be executed only after the execution of the previous subroutine; and (4) analyzing the overall time and space complexity of the proposed approach. The pseudo-code of the proposed approach, CS-HICLOSE, is in Algorithm 1, which mainly calls the two subroutines mentioned in steps (2) and (3).

Proposed Local Metric

We introduce the h-hop ego-centric (local) closeness centrality of node v as:

$$ egoC_{h}(v)=\sum\limits_{\tau=1}^{h} |B_{\tau}(v)|/\tau $$
(9)

where Bτ(v) indicates the set of nodes that have an exact shortest distance of length τ from node v. The intuition behind this metric is that, the farther nodes from v have lower effect in dissemination of goods (e.g. information) emerged from it.

Score Computation Subroutine

The computation of the sets Bτ(v) for τh,∀vV can be done by executing a breadth-first search (BFS) process at each node in parallel, with exploration radius of h. This will require computational cost of at most O(Δh) where Δ is the maximum degree of the network. The required memory storage at each node is also O(Δh). The computed sets can be utilized to evaluate ego closeness centrality at each node in a distributed and decentralized manner, with O(1) computational and storage cost per node. Thus we will have the following steps for ego-closeness computation:

  1. (i)

    For each node vV in the network, run BFSh(v) to calculate the number of nodes in its i-hop neighborhood denoted as Bi(v) where i ranges from 1 to h. This step can be executed in a decentralized manner for each node independently from the others.

  2. (ii)

    Once Bi(v) is available for each node vV, i ranging from 1 to h, one can easily compute the ego-closeness centrality metric based on Eq. (9). This step can be also executed in a decentralized fashion for each node independently. The pseudo-code for this subroutine is in Algorithm 2.

Score Aggregation Subroutine

The proposed compressive sensing-based method for aggregating the computed ego-centric metric is depicted in Algorithm 3, which contains fours steps:

  1. (i)

    The first node vfirst is added to the visited set S and all of its neighbors are added to the neighbor set \(\mathcal {N}(S)\).

  2. (ii)

    The next node is selected relative to egoCh(vnext) from the nodes in \(\mathcal {N}(S)\), which are already computed in the previous subroutine.

  3. (iii)

    The selected next node is added to the visited set S and it is removed from the neighbor set \(\mathcal {N}(S)\), then its neighbors are added to the neighbor set \(\mathcal {N}(S)\).

  4. (iv)

    The steps (i)−(iii) are fulfilled ‘l’ times which is the length of a measurement, to generate a new row for the matrix \(\mathcal {A}\) and the vector y.

  5. (v)

    Step (iv) is repeated ‘m’ times (in parallel) to construct a feasible measurement matrix \(\mathcal {A}\) with ‘m’ measurements and the corresponding measurement vector y.

  6. (vi)

    To find the sparse approximation \(\hat {x}\) of x, we optimize the LASSO objective function subject to the linear sketch of \(y = \mathcal {A} x\), based on Eq. (7).

In this algorithm, we have m parallel aggregation processes, where each is to be started from a node selected uniformly at random from V. The random seeds to choose the starting point of each aggregating process can be fixed in time O(m). A measurement corresponding to a process with a starting node will keep track of two sets S and \(\mathcal {N}(S)\). The set S is initialized with vfirst and the set \(\mathcal {N}(S)\) is initialized by its immediate neighbors, denoted by \(\mathcal {N}(v_{first})\). Within l sequential iterations, a candidate vnext from \(\mathcal {N}(S)\) will be selected relative to egoCh(vnext), removed from \(\mathcal {N}(S)\) and added to S. Moreover the neighbors of vnext that are not already present in \(\mathcal {N}(S)\) will be added to \(\mathcal {N}(S)\). In other words S is the set of visited nodes and \(\mathcal {N}(S)\) is the set of candidate nodes that are not in S but are connected to some node(s) in S. This ensures that the set of visited nodes S at each single iteration corresponds to an induced connected sub-graph from the network. At iteration i of total l iterations, the maximum size of \(\mathcal {N}(S)\) is min(iΔi,|V|), thus selection of a member from \(\mathcal {N}(S)\) relative to ego-closeness centralities using a binary search will be possible with computational cost of log(min(iΔi,|V|)). The total cost of applying this binary search method is O(|V| log(|V|)) in total. To show this, we consider two different cases. If \(l \le \left \lfloor {\frac {|V|}{{\Delta - 1}}} \right \rfloor \), then:

$$\begin{array}{*{20}l} \sum\limits_{i=1}^{l} \log\left(\min(i\Delta,|V|)\right) =&\sum\limits_{i = 1}^{l} {\log \left(i \Delta - i \right)}= \log (l !) + l \log (\Delta - 1) \\ \le& ~l\log(l) + l\log(\Delta)\le 2 |V| \log(|V|) \end{array} $$

Otherwise, if \(l > \left \lfloor {\frac {|V|}{{\Delta - 1}}} \right \rfloor \), then:

$$\begin{array}{*{20}l} &\sum\limits_{i=1}^{l} \log\left(\min(i\Delta,|V|)\right)= \sum\limits_{i = 1}^{\left\lfloor {\frac{|V|}{{\Delta - 1}}}\right\rfloor}{\log(i\Delta- i)} + (|V|-\left\lfloor{\frac{|V|}{{\Delta-1}}}\right\rfloor)\log(|V|) \\ = & ~\log(\left\lfloor {\frac{|V|}{{\Delta - 1}}} \right\rfloor!)+ \left\lfloor {\frac{|V|}{{\Delta - 1}}} \right\rfloor \log (\Delta - 1) + (|V| - \left\lfloor {\frac{|V|}{{\Delta - 1}}} \right\rfloor) \log (|V|) \\ \le & ~\left\lfloor {\frac{|V|}{{\Delta - 1}}} \right\rfloor (\log \big(\frac{|V|}{{\Delta - 1}} \big) + \log \big(\Delta - 1\big)) +(|V| - \left\lfloor {\frac{|V|}{{\Delta - 1}}} \right\rfloor) \log (|V|) \\ = & ~|V| \log (|V|) \end{array} $$

Moreover, the number of deletions from and additions to \(\mathcal {N}(S)\) are at most |V|. Each addition/deletion operation can be done efficiently in O(1), using an array structure. Thus, the total time complexity for the aggregating stage is O(m+|V| log(|V|)+|V|)=O(|V| log(|V|)), where we have assumed m≪|V| aggregating processes (measurements). The required space for each aggregating process is O(l) to save the visited nodes, and O(1) for saving the aggregated values of the visited nodes. Also, a space of at most O(|V|) is required for keeping track of the lists S and \(\mathcal {N}(S)\). Finally, global space storage of size O(m) is needed to save the initial measurements seeds.

The Complexity Analysis of CS-HICLOSE

Overall, our approach requires a running time of O(|V| log(|V|)+Δh), local storage of O(Δh) at each node and global storage of size O(m) for the seeds. Besides, a local storage space of O(l) is required for each aggregating process (measurement). In the next section, we will show a high correlation between the proposed ego-centric centrality with h=2 and the global closeness centrality of the nodes in various networks. The experiments indicate that one does not gain much more correlation by increasing h to some number greater than two, although one will endure Δ times higher computational and storage cost to do so, in the worst case. Thus, we suggest h=2 for satisfactory yet efficient utilization of our algorithm. It is worth noting that in most real-world networks, in particular social networks, nodes are connected to a tiny portion of the whole network’s nodes, which means Δ (and in turn Δ2) is very small. For example, the maximum number of connections allowed on Twitter and Facebook is about 5000, that is much smaller than their network size (Mahyar et al. 2018a). This shows that our approach is practically efficient and scalable on real-world networks.

Experimental Evaluation

In this section, we experimentally evaluate the performance of the proposed method in various scenarios over both synthetic and real-world networks. We first introduce the networks used for the evaluation. Then, we explain the settings of the experiments. Finally, the achieved results for each test scenario and their analyses are presented.

Datasets

For the evaluations of the proposed method, we considered both synthetic and real networks. We summarize the properties of the real-world networks used in experiments in Table 1. The four notations 〈deg〉, \(\langle \mathcal {C}\rangle \), \(\mathcal {D}\), and δ0.9 represent the “average degree”, “average clustering coefficient”, “network diameter”, and “90-percentile effective diameter”, respectively. In the case of a disconnected network, we extracted the largest (strongly) connected component.

Table 1 Real-world networks

We also considered three well-known models (i.e. Barabási-Albert (BA), Erdős-Rényi (ER), and Watts-Strogatz (SW)) for generating synthetic networks. We have summarized these networks in Table 2. In ER network, the link existence probability p=0.01 ensures that the generated network is connected as \(p > \frac {\ln |V|}{|V|}\) is a sharp threshold for the connectedness of ER networks with |V| vertices.

Table 2 Synthetic network models

Settings

To evaluate the accuracy of the proposed method (CS-HICLOSE) compared to the competing methods in identifying top-k closeness centrality nodes, we measured the precision and recall of the algorithms. Precision quantifies the number of correctly detected nodes in the list of k highest closeness centrality nodes divided by the total number of detected nodes. Recall quantifies the number of correctly identified nodes divided by the total number of nodes in the network. The relevancy of the detected nodes (precision) and the portion of relevant nodes that are detected (recall) are both of importance. To take both into account, we utilized the popular F-measure metric, a harmonic mean of precision and recall, which is defined as:

$$\begin{array}{@{}rcl@{}} \text{F-measure} = 2 \times \frac{Precision \times Recall}{Precision + Recall} \end{array} $$
(10)

Since CS-HICLOSE, RW, TopCent, and DICeNod have a source of randomness, the experiments were repeated ten times to reduce the variance. The denoted points in the figures represent the mean value of these repetitions along with their asymmetric standard deviations, which quantifies the amount of variations of F-measure at each point in each figure. Implementation codes in Python can be found at https://github.com/hamidreza-mahyar/CS-HiClose . We used POGS (POGS 2018), a fast and parallel optimization solver, for the optimization phase of CS-HICLOSE. POGS tries to minimize LASSO (Eq. (7)) as an objective function and is extremely quick by leveraging the power of GPUs. For example, (Parikh and Boyd 2014), it can solve the LASSO objective on a graph of 100,000 nodes with 10,000 measurements in only 21s on a single Nvidia K40 GPU. For computations of the global closeness centrality in Eq. (1), we used available tools in Python-iGraph package.

Evaluation Results

Correlation between Our ego-Closeness and the Global Closeness

We experimentally analyzed the correlation between the proposed ego-centric (local) centrality metric and the global closeness centrality over several synthetic and real-world networks. To compare these two centrality metrics, we used Pearson product moment correlation coefficient (ρ), which in fact measures the strength of a linear association between two variables and is defined as (Benesty et al. 2009):

$$\begin{array}{@{}rcl@{}} \rho = \frac{{\sum}_{i=1}^{|V|} (x_{i}-\overline{x})(y_{i}-\overline{y}) }{\sqrt{{\sum}_{i=1}^{|V|} (x_{i}-\overline{x})^{2} {\sum}_{i} (y_{i}-\overline{y})^{2}}} \end{array} $$
(11)

where |V| is the number of network nodes and xi, yi correspond to the local and global centrality measures of node i, respectively. \(\overline {x}\) and \(\overline {y}\) are mean of these variables. The Pearson coefficient ρ can take a value in range [−1,+1]. A value of 0 shows that there is not any association, a value greater than 0 indicates a positive association, and a value less than 0 indicates a negative association.

Table 3 illustrates the correlation coefficients between the proposed ego-closeness and the global closeness centrality. As mentioned in “Proposed Local Metric” section, the computational and storage cost of the ego-closeness centrality is directly impacted by choice of h. Thus, we aim to yield good results in distributively assessing top-k network centralities with a small value of h. We calculated the correlation for various sparsity levels k and small values of h (i.e. 2 and 3) for different networks. It is worth noting that for h=1, any local metric would be the same as the degree centrality. Overall, the results show that our proposed local metric and the global closeness centrality highly correlate on various types of networks. According to the problem addressed in this paper, we want to identify top-k central nodes for k≪|V|, so the results show that in this case choosing h=2 is sufficient yet efficient, in terms of having a good trade-off between computational complexity and accuracy.

Table 3 Pearson correlation coefficients between the proposed ego-centric closeness centrality with small exploration radius (i.e. h={2,3}) and the global closeness centrality on synthetic and real-world networks

Table 4 shows the Pearson correlation coefficients between the existing local metrics reviewed in “Local Closeness Metrics” section (i.e. Dist-Exact, DACCER, and Weight-Vol) and our proposed ego-centric centrality measure, all with h=2, and the global closeness centrality on synthetic and real-world networks. In this experiment, we mainly focus on high sparsity levels k={0.1|V|,0.2|V|,0.3|V|,0.4|V|}. After implementing DistEst (Wang and Tang 2015), we found that the computed values for this metric critically depend on parameters’ initialization (e.g. each node should have an estimation about its closeness value which is an unrealistic assumption). Moreover, this metric needs a huge number of iterations for message passing to converge. To have a fair comparison, we set the same number of iterations as our metric, but its correlation coefficients were around 0, so the results for this metric were excluded.

Table 4 Pearson correlation coefficients between the existing local metrics (Dist-Exact, DACCER, Weighted-Vol, and our proposed ego-centric centrality) with h=2 and the global closeness centrality on synthetic and real-world networks for varying percentage of sparsity

The results show that Dist-Exact for h=2 has a linear correlation, but a negative association with the closeness centrality in networks with various levels of sparsity. One can observe that our proposed metric has almost always the best correlation coefficient compared to the other metrics. Another interesting observation in Tables 3 and 4 is that our ego-centric metric has lower correlation coefficient with the global closeness centrality on the networks (i.e. ca-CondMat, ca-HepTh, and DBLP) with a small average degree, relative to their network size.

To have more analysis of the correlation between the proposed ego-centric (local) metric and the global closeness centrality, Fig. 2, shows the scatter plots of all nodes’ ranks provided by one versus the other, on various networks. Each point in the figure corresponds to a node’s rank using these two metrics. Based on the results of the previous test cases, we calculated our local measure for h=2 to have low computational complexity, yet high accuracy. One can easily observe the linear correlation and positive association (as the rank with respect to the local metric increases, so does the rank with respect to the global metric), especially for the top-k nodes’ ranks which is the target of this paper. One can easily see a similar observation, as in Tables 3 and 4, that our metric has relatively lower correlation with the global closeness centrality on ca-CondMat, ca-HepTh, and DBLP networks. These networks share the property that their average network degree is lower relative to their network size.

Fig. 2
figure 2

Correlations between the nodes’ ranks provided by the proposed local metric and the global closeness centrality on synthetic and real-world networks. These two metrics correlate very well

Although the Pearson product-moment correlation coefficient is the most common and almost exclusively used measure for correlation studies of centrality indices, non-linear dependencies are not adequately captured by it. Moreover, assuming only a linear correlation between the two scores is very strong and maybe not realistic. A common workaround to depict some of the existing non-linear dependencies is to employ the Pearson correlation on the logarithm of the original scores, and it is mainly used for illustrative purposes (Schoch 2015). Table 5 is similar to Table 3; However, it shows the Pearson correlation on the logarithms of the proposed ego-closeness (with h=2) and the global closeness scores. The result suggests that our proposed ego-centric metric not only has a high positive linear association (as inferred by Table 3) but also demonstrates a very high positive non-linear association with the global closeness centrality.

Table 5 Pearson correlation coefficients between the logarithms of the proposed ego-centric score with small exploration radius (i.e. h={2,3}) and the global closeness score on synthetic and real-world networks

Running Time Comparison

In Table 6, we empirically compare the running time for computation of the local metrics reviewed in “Local Closeness Metrics” section (i.e. Dist-Exact, DACCER, Weight-Vol, and our proposed ego-centric measure) over the synthetic networks. The running time of these metrics measured in a simulated distributed environment on a 2.5 GHz Intel Core i7 Apple MacBook Pro laptop. We set the radius of the local neighborhood for each node to h=2, similar to the other experiments and due to the same reasons.

Table 6 Running time (in milliseconds) comparison for different local metrics on synthetic networks in a simulated distributed environment

Note that in the distributed and decentralized setting that we considered here, each node in the network begins executing a process to compute its corresponding local metric based on its visible neighborhood radius. Each node’s process runs independently of the other nodes’ processes. The distributed running time that we report for a metric on a network is equal to the longest execution time among all network nodes’ processes for computation of the desired local metric. Table 6 shows that our proposed metric is the fastest local measure to be calculated locally in a decentralized manner over all of the synthetic networks.

Effect of Sparsity Level k on Accuracy:

Figure 3 shows the effect of sparsity level k on the accuracy of CS-HICLOSE in comparison with the CS-based competing methods in the case where the number of measurements set to 0.4|V| and the measurements length set to 0.25|V|. The measurements length in DICeNod is defined according to another parameter \(d = \frac {\varepsilon }{C k}m\), where \(\varepsilon \in (0, \frac {1}{6})\) and C>1. To have a fair comparison, we chose ε and C in a way that the average measurement length in this method and the other methods are the same. The higher the value of F-measure is, the more correlation between the top-k nodes identified by a method and the global closeness centrality will be.

Fig. 3
figure 3

Effect of sparsity level k on the accuracy of CS-HICLOSE and the competing methods for the number of correctly detected top-k closeness centrality nodes. For all methods, we set the number of measurements to 0.4|V| and the measurements length to 0.25|V|. The higher the value of F-measure is, the more correlation between the top-k nodes list identified by a method and the global closeness centrality will be

Effect of Number of Measurements m on Accuracy:

The accuracy of CS-HICLOSE is compared to the existing CS-based methods in terms of F-measure for varying number of measurements, while the measurements length (l) set to 0.25|V| and the sparsity (k) set to 0.15|V| in a network with |V| nodes. For DICeNod, l is determined based on m and k.

In Fig. 4, it is clearly depicted that CS-HICLOSE outperforms the competing methods in terms of having higher F-measure for almost all number of measurements. Moreover, our method has better accuracy even in small number of measurements. This improvement can be very important in the situations where performing measurements has a high computational cost (Mahyar et al. 2015a; Mahyar et al. 2013b).

Fig. 4
figure 4

Effect of the required number of measurements m on the accuracy of CS-HICLOSE in terms of F-measure, compared to RW, TopCent, and DICeNod. For each method, we set the measurements length to 0.25|V| and the sparsity to 0.15|V| in a network with |V| nodes

Effect of Measurement Length l on Accuracy

Figure 5 illustrates that CS-HICLOSE has higher F-measure for the most measurement lengths in all test cases, in comparison with the CS-based methods RW, TopCent and DICeNod. Since the concept of measurement length is again irrelevant to the other competing methods, we only compared our accuracy with the CS-based approaches. The horizontal axis in Fig. 5 shows the measurement length l divided by the total number of network nodes |V|. This experiment is performed over the network with |V| nodes where the number of measurements sets to m=0.4|V| and the sparsity level sets to k=0.2|V| for all methods. We repeated each test 10 times to reduce the methods’ randomness, and the points in the figures show the mean value of these repetitions. In Fig. 5, we can easily observe an increasing trend for F-measure in CS-HICLOSE when we increase the length of the measurements.

Fig. 5
figure 5

Effect of measurement length l on the accuracy of CS-HICLOSE in terms of F-measure, compared to RW, TopCent, and DICeNod. For each method, we set the number of measurements to 0.4|V| and the sparsity level to 0.2|V| in a network with |V| nodes

Conclusion

Closeness centrality has been utilized as a primary metric to measure the relative importance/influence of nodes in a given network. In this paper, we introduced a new ego-centric metric which has little computational cost and correlates well with the global closeness centrality. Then, we proposed a compressive sensing framework for distributed detection of top-k central nodes based on the ego-closeness metric using only indirect measurements. Extensive experimental evaluations on both synthetic and real networks demonstrated that the proposed method outperforms the best existing methods to efficiently detect high closeness centrality nodes, in terms of having high F-measure and low complexity. The experimental results indicated that our proposed ego-centric metric depicts a lower correlation with the global closeness network centrality on networks with low average degree relative to their network’s size. Generalization of our ego-centric metric to address this limitation will be of interest for future work.