RCELF: A residual-based approach for Influence Maximization Problem
Introduction
The Influence Maximization Problem (IMP) has been widely studied in literature [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25]. Mathematically, given a network , a positive integer and a diffusion model , IMP returns a size- nodes subset (in ) which has the maximum spread () in . In particular, the diffusion model defines the exact diffusion manner in a network. Independent Cascade (IC), Weighted Cascade (WC) and Linear Threshold (LT) are three widely-accepted models [26], [27]. The spread () is the expected number of influenced nodes through the seed set according to the given diffusion model . A primary application of IMP is viral marking [2], [26], [28], [29], [30], [31], [32], [33], [34] in social networks (e.g., Facebook, Twitter, Weibo), which is widely applied for promoting new products or activities. For example, new products are advertised by some influencers in social networks to other users by “word-of-mouth” effect [26]. Besides viral marking, IMP can also be applied in epidemic detection [2], [35], [36], [37], rumor control [28], [29] and water network monitoring [2], [38], [39], [40].
In general, diffusion models define how the node can switch its status from inactive to active on a weighted graph, where the weight of each edge is the influence probability. For example, an active node has single chance to influence its inactive neighbor with probability in IC model. In literature, a commonly-used influence probability assignment method in both IC (i.e., WC) and LT models is setting [41], where is the in-degree of . This assignment method assumes a user probably can be activated if all her incoming neighbors are active in both WC and LT models. However, it may not be practical in some real-world applications. For example, many users in Twitter (e.g., the users who use Twitter less than once per day) probably are not be influenced even all the neighbors are active. Interestingly, there is also another kind of social network where the users can be influenced even only one or few of its neighbors are active, e.g., users in Pinduoduo1 (an online shopping website in China) can be easily influenced as they can form a shopping team to get a lower price for their purchase. To overcome the above limitations of common-used probability assignment method, i.e., , We propose a generalized probability assignment method in this work. Specifically, the probability that node can activate node at edge is where reflects the activeness of the users. When , the generalized IC and LT models degrade to the conventional IC and LT models respectively.
It is NP-hard to find the optimal size- set for IMP with Independent Cascade (IC) and Linear Threshold (LT) diffusion models [1]. Due to the hardness of the IMP problem, many approximation approaches [2], [3], [4], [7], [8], [10], [11], [12], [13], [14] and heuristic solutions [9], [15], [16], [17], [18], [19], [20], [21], [22], [24], [43] have been proposed and extensively studied. However, all existing approaches (see Fig. 1) are trading-off among time efficiency, memory consumption, and result quality [42]. Since the empirical performance of the state-of-the-art approximation approaches (e.g., IMM [11], DSSA [13]) are comparable to, even outperform the state-of-the-art heuristic solutions (e.g., IRIE [20], IMRank [43]), we focus on the approximated solutions for IMP in this work.
Existing approximation approaches for IMP: We classify existing approximation approaches into three categories, Monte-Carlo Simulation, Snapshots and Reverse Influence Sampling, respectively. We summarize the representative approaches of each category in Fig. 2. For IMP, it is vital to overcome the #P-hardness of evaluating the spread () given a seed set . The prior approximation works propose different strategies to estimate the expectation of spread. In particular, Monte-Carlo Simulation-based approaches (e.g., GREEDY [1], CELF(Cost-Effective Lazy Forward) [2] and CELF++ [3]) estimate by repeatedly simulating the diffusion process, which is not scalable as the cost of Monte-Carlo simulations is rather expensive. Snapshots-based approaches (e.g., SG [4] and PMC [7]) are proposed to sample subgraphs (a.k.a., snapshots) of social network in advance by retaining each edge with a probability of its weight. The spread on is estimated by averaging the influence on all snapshots. However, the memory overhead for snapshot storage is prohibitively huge. For Reverse Influence Sampling-based approaches (e.g., TIM [10], IMM [11] and DSSA [13]), they samples a sufficient number of reverse reachable sets for the nodes. However, the number and size of reverse reachable sets can be numerous (see Section 2.2).
Our approach: In this paper, we propose a novel residual-based algorithm RCELF to overcome the deficiencies of existing approximation approaches. The core of RCELF is the novel marginal gain computation method based on probability theory. Specifically, we define the residual capacity of a node as the maximal contribution of that node can make to influence spread value of the seed set. Initially, the residual capacity of each node is . During each seed node selection process, the residual capacity of each node will diminish by either being selected as a seed node or being influenced by other selected seed node. RCELF approach achieves excellent time efficiency as (i) it enjoys the benefits of the submodularity of the residual-based influence function and cost-effective lazy forward node selection manner, however, it requires much fewer Monte-Carlo simulations; (ii) the number of nodes under consideration (i.e., their residual capacities are large than 0) falls quickly during seed set selection process. and (iii) two optimizations are devised to speedup RCELF. Meanwhile, RCELF guarantees -approximation of the result, as elaborated in Section 3. From memory consumption perspective, RCELF only stores the raw social network data. It does not have any extra memory consumption when comparing with snapshot-based approaches and reverse influence sampling-based approaches. Thus, the space complexity of RCELF is optimal, i.e., . Moreover, RCELF is robust to a generalized probability assignment method, i.e., , in both WC and LT models. is a tunable parameter and reflects the influence degree of each user in the social network. In summary, RCELF achieves excellent time efficiency, low memory consumption and approximation guaranteed result quality for IMP in widely used diffusion models (i.e., as shown in the center of Fig. 1).
We summarize the comparison among our proposal RCELF and existing approximate approaches for IMP in Table 1. Specifically, compared with the state-of-the-art methods, RCELF satisfies the approximate ratio with the least memory consumption. Although RCELF has the same time complexity with Snapshot-based methods, our RCELF is empirically more efficient (see Section 4). The contributions of this paper are summarized as follows.
- •
Based on our proposed generalized influence probability assignment method, comprehensive experiments are conducted to reveal the trade-off strategies among the state-of-the-art approximate approaches for IMP (Section 2).
- •
We propose a residual-based algorithm RCELF for IMP to achieve excellent time efficiency, low memory overhead, and approximation guaranteed results concurrently (Section 3).
- •
We evaluate the effectiveness and efficiency of our proposal by extensive experiments on real-world benchmark datasets (Section 4).
The remainder of this paper is organized as follows. Section 2 describes the preliminaries and related works of IMP and conducts comprehensive experiments on the state-of-the-art approximate approaches to reveal their underlying issues. Section 3 presents our residual-based approach RCELF for IMP. Section 4 verifies the superiority of our proposal by extensive experiments, followed by the conclusion in Section 5.
Section snippets
Influence Maximization Problem
In this section, we first define the influence maximization problem (IMP) formally. Then, we conduct extensive preliminary experiments on the representative approximate approaches, and present the findings of existing approaches.
Residual-based approach
Existing approximate approaches for IMP are compromising either time efficiency or memory overhead for result quality. In this section, we propose a novel residual-based approach (i.e., RCELF) for IMP to overcome this dilemma. We present the intuition and fundamental concepts of RCELF approach in Section 3.1 and Section 3.2 respectively. In Section 3.3, we describe the backbone of RCELF and devise two performance optimization techniques for it. We conduct correctness, complexity and approximate
Experimental evaluation
In this section we evaluate RCELF and present our empirical findings. In Section 4.1 we describe the experimental setting. In Section 4.2 we compare our proposal with existing approximate approaches in real datasets, and investigate the effectiveness of the optimization technique.
Conclusion
In this paper, we discuss existing approximation solutions for IMP, which are compromising time efficiency or memory consumption for the approximate result quality. In order to address that, we propose a residual-based algorithm RCELF for IMP, which achieves good time efficiency, low memory consumption and approximate guaranteed result quality concurrently in generalized WC and LT models. Besides, we propose several optimizations to accelerate the performance of RCELF. We demonstrate the
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the National Science Foundation of China (NSFC No. 61802163), the Education Department of Guangdong (Grant No. 2020KZDZX1184, 2020ZDZX3043), and the Guangdong Provincial Key Laboratory (Grant No. 2020B121201001).
References (62)
- et al.
The budgeted maximum coverage problem
Inform. Process. Lett.
(1999) - et al.
C2IM: Community based context-aware influence maximization in social networks
Physica A
(2019) - et al.
MIM2: Multiple influence maximization across multiple social networks
Physica A
(2019) - et al.
ComBIM: A community-based solution approach for the budgeted influence maximization problem
Expert Syst. Appl.
(2019) - et al.
Maximizing the spread of influence through a social network
- et al.
Cost-effective outbreak detection in networks
- et al.
Celf++: optimizing the greedy algorithm for influence maximization in social networks
- et al.
Staticgreedy: solving the scalability-accuracy dilemma in influence maximization
- et al.
Efficient algorithms for adaptive influence maximization
Proc. VLDB Endow.
(2018) - et al.
Online processing algorithms for influence maximization
Fast and accurate influence maximization on large networks with pruned Monte-Carlo simulations
Holistic influence maximization: Combining scalability and efficiency with opinion-aware models
Influence maximization: Near-optimal time complexity meets practical efficiency
Influence maximization in near-linear time: A martingale approach
On the upper bounds of spread for greedy algorithms in social network influence maximization
IEEE Trans. Knowl. Data Eng.
Stop-and-stare: Optimal sampling algorithms for viral marketing in billion-scale networks
Revisiting the stop-and-stare algorithms for influence maximization
Proc. VLDB Endow.
Efficient influence maximization in social networks
Scalable influence maximization for prevalent viral marketing in large-scale social networks
Scalable influence maximization in social networks under the linear threshold model
Simpath: An efficient algorithm for influence maximization under the linear threshold model
Irie: Scalable and robust influence maximization in social networks
Scalable and parallelizable processing of influence maximization for large-scale social networks?
Influence maximization over large-scale social networks: A bounded linear approach
Sketch-based influence maximization and computation: Scaling up with guarantees
Importance sketching of influence dynamics in billion-scale networks
Talk of the network: A complex systems look at the underlying process of word-of-mouth
Mark. Lett.
Threshold models of collective behavior
Am. J. Sociol.
Cited by (8)
Predicting information diffusion via deep temporal convolutional networks
2022, Information SystemsCitation Excerpt :Feature-based methods regard the information cascade prediction as a regression or classification problem. The user features [5], content features [1,2], structural features [21–23] and temporal features [3,4] of the cascade are designed artificially from the raw data. Some of the features are selected as the input of machine learning methods to fit the future growth scale of the cascade.
The correlations among COVID-19, the effect of public opinion, and the systemic risks of China's financial industries
2022, Physica A: Statistical Mechanics and its ApplicationsStructural-topic aware deep neural networks for information cascade prediction
2024, PeerJ Computer ScienceIdentifying influential nodes for influence maximization problem in social networks using an improved discrete particle swarm optimization
2023, Social Network Analysis and MiningCapacity Constrained Influence Maximization in Social Networks
2023, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining