Elsevier

Information Systems

Volume 102, December 2021, 101828
Information Systems

RCELF: A residual-based approach for Influence Maximization Problem

https://doi.org/10.1016/j.is.2021.101828Get rights and content

Highlights

  • We reveal the trade-off strategies among the approximate approaches for IMP.

  • An effective and efficient approximate algorithm called RCELF is proposed.

  • The performance of RCELF is extensively evaluated on 5 real-world datasets.

Abstract

Influence Maximization Problem (IMP) is selecting a seed set of nodes in the social network to spread the influence as widely as possible. It has many applications in multiple domains, e.g., viral marketing is frequently used for new products or activities advertisement. While it is a classic and well-studied problem in computer science, unfortunately, all those proposed techniques are compromising among time efficiency, memory consumption, and result quality. In this paper, we conduct comprehensive experimental studies on the state-of-the-art IMP approximate approaches to reveal the underlying trade-off strategies. Interestingly, we find that even the state-of-the-art approaches are impractical when the propagation probability of the network have been taken into consideration. With the findings of existing approaches, we propose a novel residual-based approach (i.e., RCELF) for IMP, which (i) overcomes the deficiencies of existing approximate approaches, and (ii) provides theoretical guaranteed results with high efficiency in both time- and space-perspectives. We demonstrate the superiority of our proposal by extensive experimental evaluation on real datasets.

Introduction

The Influence Maximization Problem (IMP) has been widely studied in literature [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25]. Mathematically, given a network G, a positive integer k and a diffusion model M, IMP returns a size-k nodes subset S (in G) which has the maximum spread σ (S) in G. In particular, the diffusion model M defines the exact diffusion manner in a network. Independent Cascade (IC), Weighted Cascade (WC) and Linear Threshold (LT) are three widely-accepted models [26], [27]. The spread σ (S) is the expected number of influenced nodes through the seed set S according to the given diffusion model M. A primary application of IMP is viral marking [2], [26], [28], [29], [30], [31], [32], [33], [34] in social networks (e.g., Facebook, Twitter, Weibo), which is widely applied for promoting new products or activities. For example, new products are advertised by some influencers in social networks to other users by “word-of-mouth” effect [26]. Besides viral marking, IMP can also be applied in epidemic detection [2], [35], [36], [37], rumor control [28], [29] and water network monitoring [2], [38], [39], [40].

In general, diffusion models define how the node can switch its status from inactive to active on a weighted graph, where the weight of each edge is the influence probability. For example, an active node u has single chance to influence its inactive neighbor v with probability w(u,v) in IC model. In literature, a commonly-used influence probability assignment method in both IC (i.e., WC) and LT models is setting w(u,v) =1/|In(v)| [41], where In(v) is the in-degree of v. This assignment method assumes a user probably can be activated if all her incoming neighbors are active in both WC and LT models. However, it may not be practical in some real-world applications. For example, many users in Twitter (e.g., the users who use Twitter less than once per day) probably are not be influenced even all the neighbors are active. Interestingly, there is also another kind of social network where the users can be influenced even only one or few of its neighbors are active, e.g., users in Pinduoduo1 (an online shopping website in China) can be easily influenced as they can form a shopping team to get a lower price for their purchase. To overcome the above limitations of common-used probability assignment method, i.e., 1/|In(v)|, We propose a generalized probability assignment method in this work. Specifically, the probability that node u can activate node v at edge (u,v) is w(u,v)=ρ/|In(v)| where ρ reflects the activeness of the users. When ρ=1, the generalized IC and LT models degrade to the conventional IC and LT models respectively.

It is NP-hard to find the optimal size-k set for IMP with Independent Cascade (IC) and Linear Threshold (LT) diffusion models [1]. Due to the hardness of the IMP problem, many approximation approaches  [2], [3], [4], [7], [8], [10], [11], [12], [13], [14] and heuristic solutions [9], [15], [16], [17], [18], [19], [20], [21], [22], [24], [43] have been proposed and extensively studied. However, all existing approaches (see Fig. 1) are trading-off among time efficiency, memory consumption, and result quality [42]. Since the empirical performance of the state-of-the-art approximation approaches (e.g., IMM [11], DSSA [13]) are comparable to, even outperform the state-of-the-art heuristic solutions (e.g., IRIE [20], IMRank [43]), we focus on the approximated solutions for IMP in this work.

Existing approximation approaches for IMP: We classify existing approximation approaches into three categories, Monte-Carlo Simulation, Snapshots and Reverse Influence Sampling, respectively. We summarize the representative approaches of each category in Fig. 2. For IMP, it is vital to overcome the #P-hardness of evaluating the spread σ (S) given a seed set S. The prior approximation works propose different strategies to estimate the expectation of spread. In particular, Monte-Carlo Simulation-based approaches (e.g., GREEDY [1], CELF(Cost-Effective Lazy Forward) [2] and CELF++ [3]) estimate σ() by repeatedly simulating the diffusion process, which is not scalable as the cost of Monte-Carlo simulations is rather expensive. Snapshots-based approaches (e.g., SG [4] and PMC [7]) are proposed to sample subgraphs G i (a.k.a., snapshots) of social network G in advance by retaining each edge with a probability of its weight. The spread on G is estimated by averaging the influence on all snapshots. However, the memory overhead for snapshot storage is prohibitively huge. For Reverse Influence Sampling-based approaches (e.g., TIM [10], IMM [11] and DSSA [13]), they samples a sufficient number of reverse reachable sets for the nodes. However, the number and size of reverse reachable sets can be numerous (see Section 2.2).

Our approach: In this paper, we propose a novel residual-based algorithm RCELF to overcome the deficiencies of existing approximation approaches. The core of RCELF is the novel marginal gain computation method based on probability theory. Specifically, we define the residual capacity of a node as the maximal contribution of that node can make to influence spread value of the seed set. Initially, the residual capacity of each node is 1. During each seed node selection process, the residual capacity of each node will diminish by either being selected as a seed node or being influenced by other selected seed node. RCELF approach achieves excellent time efficiency as (i) it enjoys the benefits of the submodularity of the residual-based influence function and cost-effective lazy forward node selection manner, however, it requires much fewer Monte-Carlo simulations; (ii) the number of nodes under consideration (i.e., their residual capacities are large than 0) falls quickly during seed set selection process. and (iii) two optimizations are devised to speedup RCELF. Meanwhile, RCELF guarantees (11/eε)-approximation of the result, as elaborated in Section 3. From memory consumption perspective, RCELF only stores the raw social network data. It does not have any extra memory consumption when comparing with snapshot-based approaches and reverse influence sampling-based approaches. Thus, the space complexity of RCELF is optimal, i.e., O(n+m). Moreover, RCELF is robust to a generalized probability assignment method, i.e., ρ/|In(v)|, in both WC and LT models. ρ is a tunable parameter and reflects the influence degree of each user in the social network. In summary, RCELF achieves excellent time efficiency, low memory consumption and approximation guaranteed result quality for IMP in widely used diffusion models (i.e., as shown in the center of Fig. 1).

We summarize the comparison among our proposal RCELF and existing approximate approaches for IMP in Table 1. Specifically, compared with the state-of-the-art methods, RCELF satisfies the (11/eε) approximate ratio with the least memory consumption. Although RCELF has the same time complexity with Snapshot-based methods, our RCELF is empirically more efficient (see Section 4). The contributions of this paper are summarized as follows.

  • Based on our proposed generalized influence probability assignment method, comprehensive experiments are conducted to reveal the trade-off strategies among the state-of-the-art approximate approaches for IMP (Section 2).

  • We propose a residual-based algorithm RCELF for IMP to achieve excellent time efficiency, low memory overhead, and approximation guaranteed results concurrently (Section 3).

  • We evaluate the effectiveness and efficiency of our proposal by extensive experiments on real-world benchmark datasets (Section 4).

The remainder of this paper is organized as follows. Section 2 describes the preliminaries and related works of IMP and conducts comprehensive experiments on the state-of-the-art approximate approaches to reveal their underlying issues. Section 3 presents our residual-based approach RCELF for IMP. Section 4 verifies the superiority of our proposal by extensive experiments, followed by the conclusion in Section 5.

Section snippets

Influence Maximization Problem

In this section, we first define the influence maximization problem (IMP) formally. Then, we conduct extensive preliminary experiments on the representative approximate approaches, and present the findings of existing approaches.

Residual-based approach

Existing approximate approaches for IMP are compromising either time efficiency or memory overhead for result quality. In this section, we propose a novel residual-based approach (i.e., RCELF) for IMP to overcome this dilemma. We present the intuition and fundamental concepts of RCELF approach in Section 3.1 and Section 3.2 respectively. In Section 3.3, we describe the backbone of RCELF and devise two performance optimization techniques for it. We conduct correctness, complexity and approximate

Experimental evaluation

In this section we evaluate RCELF and present our empirical findings. In Section 4.1 we describe the experimental setting. In Section 4.2 we compare our proposal with existing approximate approaches in real datasets, and investigate the effectiveness of the optimization technique.

Conclusion

In this paper, we discuss existing approximation solutions for IMP, which are compromising time efficiency or memory consumption for the approximate result quality. In order to address that, we propose a residual-based algorithm RCELF for IMP, which achieves good time efficiency, low memory consumption and approximate guaranteed result quality concurrently in generalized WC and LT models. Besides, we propose several optimizations to accelerate the performance of RCELF. We demonstrate the

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Science Foundation of China (NSFC No. 61802163), the Education Department of Guangdong (Grant No. 2020KZDZX1184, 2020ZDZX3043), and the Guangdong Provincial Key Laboratory (Grant No. 2020B121201001).

References (62)

  • OhsakaN. et al.

    Fast and accurate influence maximization on large networks with pruned Monte-Carlo simulations

  • C. Borgs, M. Brautbar, J. Chayes, B. Lucier, Maximizing social influence in nearly optimal time, in: SODA, 2014, pp....
  • GalhotraS. et al.

    Holistic influence maximization: Combining scalability and efficiency with opinion-aware models

  • TangY. et al.

    Influence maximization: Near-optimal time complexity meets practical efficiency

  • TangY. et al.

    Influence maximization in near-linear time: A martingale approach

  • ZhouC. et al.

    On the upper bounds of spread for greedy algorithms in social network influence maximization

    IEEE Trans. Knowl. Data Eng.

    (2015)
  • NguyenH.T. et al.

    Stop-and-stare: Optimal sampling algorithms for viral marketing in billion-scale networks

  • HuangK. et al.

    Revisiting the stop-and-stare algorithms for influence maximization

    Proc. VLDB Endow.

    (2017)
  • M. Kimura, K. Saito, Tractable models for information diffusion in social networks, in: PKDD, 2006, pp....
  • ChenW. et al.

    Efficient influence maximization in social networks

  • ChenW. et al.

    Scalable influence maximization for prevalent viral marketing in large-scale social networks

  • ChenW. et al.

    Scalable influence maximization in social networks under the linear threshold model

  • GoyalA. et al.

    Simpath: An efficient algorithm for influence maximization under the linear threshold model

  • JungK. et al.

    Irie: Scalable and robust influence maximization in social networks

  • KimJ. et al.

    Scalable and parallelizable processing of influence maximization for large-scale social networks?

  • LiuQ. et al.

    Influence maximization over large-scale social networks: A bounded linear approach

  • CohenE. et al.

    Sketch-based influence maximization and computation: Scaling up with guarantees

  • J. Tang, X. Tang, J. Yuan, Influence maximization meets efficiency and effectiveness: A hop-based approach, in: ASONAM,...
  • NguyenH.T. et al.

    Importance sketching of influence dynamics in billion-scale networks

  • GoldenbergJ. et al.

    Talk of the network: A complex systems look at the underlying process of word-of-mouth

    Mark. Lett.

    (2001)
  • GranovetterM.

    Threshold models of collective behavior

    Am. J. Sociol.

    (1978)
  • Cited by (8)

    • Predicting information diffusion via deep temporal convolutional networks

      2022, Information Systems
      Citation Excerpt :

      Feature-based methods regard the information cascade prediction as a regression or classification problem. The user features [5], content features [1,2], structural features [21–23] and temporal features [3,4] of the cascade are designed artificially from the raw data. Some of the features are selected as the input of machine learning methods to fit the future growth scale of the cascade.

    • Capacity Constrained Influence Maximization in Social Networks

      2023, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
    View all citing articles on Scopus
    View full text