Skip to main content
Log in

CrawlSN: community-aware data acquisition with maximum willingness in online social networks

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Real social network datasets with community structures are critical for evaluating various algorithms in Online Social Networks (OSNs). However, obtaining such community data from OSNs has recently become increasingly challenging due to privacy issues and government regulations. In this paper, we thus make our first attempt to address two important factors, i.e., user willingness and existence of community structure, to obtain more complete OSN data. We formulate a new research problem, namely Community-aware Data Acquisition with Maximum Willingness in Online Social Networks (CrawlSN), to identify a group of users from an OSN, such that the group is a socially tight community and the users’ willingness to contribute data is maximized. We prove that CrawlSN is NP-hard and inapproximable within any factor unless, and propose an effective algorithm, named Community-aware Group Identification with Maximum Willingness (CIW) with various processing strategies. We conduct an evaluation study with 1093 volunteers to validate our problem formulation and demonstrate that CrawlSN outperforms the other alternatives. We also perform extensive experiments on 7 real datasets and show that the proposed CIW outperforms the other baselines in both solution quality and efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. Many other factors, such as relationship types of users are also important. Here, we discuss the two fundamental factors to crawl the community data for further analysis and discuss the other important factors in the future work.

  2. We show undirected edges here for the clarity of presentation. Directed relations can be easily incorporated in our problem formulation.

  3. We have implemented a light crawler using python 3.6, which is able to obtain the publicly accessible user data in OSNs.

  4. Influence strengths can be inferred by employing existing approaches (Kempe et al. 2003; Gomez-Rodriguez et al. 2012).

  5. We have built a simple machine learning model with SVM that predicts users’ willingness with their publicly accessible information on Facebook.

  6. We can also consider directed influences in our problem formulation with a slight modification of the algorithm.

  7. If \(\tau _{v}=0\), we define the value of the second term of Eq. 1 as 0. That is, if \(\tau _{v}=0\), \(\frac{\sum _{u \in N_S(v)} \delta _{u,\emptyset } \cdot w_{u,v}}{\tau _v}=0\).

  8. In some extreme cases, for a user who is very unwilling to provide her data (i.e., with a small individual willingness), the influenced willingness may raise the value of Eq. 1 up to 1. To tackle this issue, an additional parameter \(\beta \in [0,1]\) can be added to the second term (i.e., influenced willingness) of Eq. 1 as follows. By setting a smaller \(\beta \), i.e., close to 0, the user’s individual willingness becomes more important in the computation of the average willingness.

  9. https://www.ibm.com/analytics/cplex-optimizer.

  10. http://www.gurobi.com.

  11. The source codes are available online http://www.cs.nthu.edu.tw/~chihya/CIW_download/.

References

  • Aksu H, Canim M, Chang Y, Korpeoglu I, Ulusoy O (2014) Distributed \(k\) -core view materialization and maintenance for large dynamic graphs. IEEE Trans Knowl Data Eng 26(10):2439–2452

    Article  Google Scholar 

  • Alvarez-Hamelin J, Dall’Asta L, Barrat A, Vespignani A (2005) K-core decomposition of internet graphs: hierarchies, self-similarity and measurement biases. Networks and Heterogeneous Media 3, Dec

  • Aridhi S, Brugnara M, Montresor A, Velegrakis Y (2016) Distributed k-core decomposition and maintenance in large dynamic graphs. In: Proceedings of the 10th ACM international conference on distributed and event-based systems, pp 161–168

  • Balasundaram B, Butenko S, Hicks IV (2011) Clique relaxations in social network analysis: the maximum k-plex problem. Oper Res 59(1):133–142

    Article  MathSciNet  Google Scholar 

  • Blenn N, Doerr C, Van Kester B, Van Mieghem P (2012) Crawling and detecting community structure in online social networks using local information. In Bestak R, Kencl L, Li LE, Widmer J, Yin H (eds) Networking 2012, pp 56–67

  • Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 2008(10):P10008

    Article  Google Scholar 

  • Bond RM, Fariss CJ, Jones JJ, Kramer AD, Marlow C, Settle JE, Fowler JH (2012) A 61-million-person experiment in social influence and political mobilization. Nature 489(7415):295

    Article  Google Scholar 

  • Candogan O (2019) Persuasion in networks: public signals and k-cores. In Proceedings of the 2019 ACM conference on economics and computation, EC ’19, pp 133–134. Association for Computing Machinery

  • Centola D (2010) The spread of behavior in an online social network experiment. Science 329(5996):1194–1197

    Article  Google Scholar 

  • Chen W, Wang Y, Yang S (2009) Efficient influence maximization in social networks. In Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 199–208

  • Chen S, Fan J, Li G, Feng J, Tan K-L, Tang J (2015) Online topic-aware influence maximization. Proc VLDB Endow 8(6):666–677

    Article  Google Scholar 

  • Cheng J, Ke Y, Fu AW-C, Yu JX, Zhu L (2010) Finding maximal cliques in massive networks by h*-graph. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, pp 447–458

  • Cui W, Xiao Y, Wang H, Wang W (2014) Local search of communities in large graphs. In: Proceedings of the 2014 ACM SIGMOD international conference on management of data, SIGMOD ’14, pp 991–1002

  • Deutsch M, Gerard HB (1955) A study of normative and informational social influences upon individual judgment. J Abnormal Soc Psychol 51(3):629

    Article  Google Scholar 

  • Fang Y, Cheng R, Luo S, Hu J (2016) Effective community search for large attributed graphs. Proceedings of the VLDB Endowment 9(12):1233–1244

    Article  Google Scholar 

  • Giatsidis C, Thilikos DM, Vazirgiannis M (2011) Evaluating cooperation in communities with the k-core structure. In: 2011 international conference on advances in social networks analysis and mining, pp 87–93

  • Gjoka M, Kurant M, Butts CT, Markopoulou A (2011) Practical recommendations on crawling online social networks. IEEE J Sel Areas Commun 29(9):1872–1892

    Article  Google Scholar 

  • Gomez-Rodriguez M, Leskovec J, Krause A (2012) Inferring networks of diffusion and influence. ACM Trans Knowl Discov from Data 5(4)

  • Goodfellow I, Bengio Y, Courville A (2016) Deep learning. The MIT Press, Cambridge

    MATH  Google Scholar 

  • Goyal A, Bonchi F, Lakshmanan LV (2010) Learning influence probabilities in social networks. In: Proceedings of the third ACM international conference on web search and data mining, WSDM ’10, pp 241–250

  • Hsu B, Shen C, Yan X (2019a) Network intervention for mental disorders with minimum small dense subgroups. IEEE Trans Knowl Data Eng. 1–1

  • Hsu B-Y, Tu C-L, Chang M-Y, Shen C-Y (2019b) On crawling community-aware online social network data. In: Proceedings of the 30th ACM conference on hypertext and social media, pp 265–266

  • Huang X, Cheng H, Qin L, Tian W, Yu JX (2014) Querying k-truss community in large and dynamic graphs. In: Proceedings of the 2014 ACM SIGMOD international conference on management of data, pp 1311–1322

  • Huang X, Lakshmanan LV, Yu JX, Cheng H (2015) Approximate closest community search in networks. Proc VLDB Endow 9(4):276–287

    Article  Google Scholar 

  • Hung H-J, Lee W-C, Yang D-N, Shen C-Y, Lei Z, Chow S-M (2020) Efficient algorithms towards network intervention. In: Proceedings of the web conference 2020

  • Kempe D, Kleinberg J, Tardos E (2003) Maximizing the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’03, pp 137–146

  • Kubat M (2015) An introduction to machine learning, 1st edn. Springer, Berlin

    Book  Google Scholar 

  • Laishram R, Wendt J, Soundarajan S (2019) Crawling the community structure of multiplex networks. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 168–175

  • Leskovec J, Mcauley JJ (2012) Learning to discover social circles in ego networks. In: Advances in neural information processing systems, pp 539–547

  • Leskovec J, Lang KJ, Dasgupta A, Mahoney MW (2009) Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Math 6(1):29–123

    Article  MathSciNet  Google Scholar 

  • Li G, Chen S, Feng J, Tan K-l, Li W-s (2014) Efficient location-aware influence maximization. In: Proceedings of the 2014 ACM SIGMOD international conference on management of data, pp 87–98

  • Li R-H, Qin L, Yu JX, Mao R (2015) Influential community search in large networks. Proc VLDB Endow 8(5):509–520

    Article  Google Scholar 

  • Li Y, Zhang D, Tan K-L (2015) Real-time targeted influence maximization for online advertisements. Proc VLDB Endow 8(10):1070–1081

    Article  Google Scholar 

  • Li J, Wang X, Deng K, Yang X, Sellis T, Yu JX (2017) Most influential community search over large social networks. In: 2017 IEEE 33rd international conference on data engineering, pp 871–882

  • Lu W, Bonchi F, Goyal A, Lakshmanan LV (2013) The bang for the buck: fair competitive viral marketing from the host perspective. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, pp 928–936

  • Mokken RJ (1979) Cliques, clubs and clans. Quality & Quantity 13(2):161–173

    Article  Google Scholar 

  • Mucha PJ, Richardson T, Macon K, Porter MA, Onnela J-P (2010) Community structure in time-dependent, multiscale, and multiplex networks. Science 328(5980):876–878

    Article  MathSciNet  Google Scholar 

  • Reproducibility materials. http://www.cs.nthu.edu.tw/~chihya/CIW_download/, 2020

  • Seidman SB (1983) Network structure and minimum degree. Soc Netw 5(3):269–287

    Article  MathSciNet  Google Scholar 

  • Shen C-Y, Yang D-N, Huang L-H, Lee W-C, Chen M-S (2016) Socio-spatial group queries for impromptu activity planning. IEEE Trans Knowl Data Eng 28(1):196–210

    Article  Google Scholar 

  • Shen C-Y, Huang L-H, Yang D-N, Shuai H-H, Lee W-C, Chen M-S (2017) On finding socially tenuous groups for online social networks. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 415–424

  • Shen C-Y, Fotsing CPK, Yang D-N, Chen Y-S, Lee W-C (2018) On organizing online soirees with live multi-streaming. In: AAAI conference on artificial intelligence

  • Shin K, Eliassi-Rad T, Faloutsos C (2016) Corescope: Graph mining using k-core analysis—patterns, anomalies and algorithms. In: 2016 IEEE 16th international conference on data mining, pp 469–478

  • Shuai H-H, Yang D-N, Yu PS, Chen M-S (2013) Willingness optimization for social group activity. Proc VLDB Endow 7(4):253–264

    Article  Google Scholar 

  • Song C, Hsu W, Lee ML (2017) Temporal influence blocking: Minimizing the effect of misinformation in social networks. In: 2017 IEEE 33rd international conference on data engineering, pp 847–858

  • Wang K, Cao X, Lin X, Zhang W, Qin L (2018) Efficient computing of radius-bounded k-cores. In: 2018 IEEE 34th international conference on data engineering (ICDE), pp 233–244

  • Yang J, Leskovec J (2015) Defining and evaluating network communities based on ground-truth. Knowl Inf Syst 42(1):181–213

    Article  Google Scholar 

  • Yang D-N, Shen C-Y, Lee W-C, Chen M-S (2012) On socio-spatial group query for location-based social networks. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’12, pp 949–957

  • Yang D-N, Hung H-J, Lee W-C, Chen W (2013) Maximizing acceptance probability for active friending in online social networks. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, pp 713–721

  • Yang Y, Mao X, Pei J, He X (2016) Continuous influence maximization: What discounts should we offer to social network users? In: Proceedings of the 2016 international conference on management of data, pp 727–741

  • Ye S, Lang J, Wu F (2010) Crawling online social graphs. In: 2010 12th international Asia-Pacific web conference, pp 236–242

  • Zhang Y, Parthasarathy S (2012) Extracting analyzing and visualizing triangle k-core motifs within networks. In: 2012 IEEE 28th international conference on data engineering, pp 1049–1060

  • Zhang F, Zhang W, Zhang Y, Qin L, Lin X (2017) Olak: an efficient algorithm to prevent unraveling in social networks. Proc VLDB Endow 10(6):649–660

    Article  Google Scholar 

  • Zhang F, Zhang Y, Qin L, Zhang W, Lin X (2017) When engagement meets similarity: efficient (k, r)-core computation on social networks. Proc VLDB Endow 10(10):998–1009

    Article  Google Scholar 

  • Zhu Q, Hu H, Xu C, Xu J, Lee W-C (2017) Geo-social group queries with minimum acquaintance constraints. VLDB J 26(5):709–727

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the Ministry of Science and Technology, Taiwan, under Grant MOST 109-2636-E-007-019 and MOST 108-2218-E-468-002.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chih-Ya Shen.

Additional information

Responsible editor: Ira Assent, Carlotta Domeniconi, Aristides Gionis, Eyke Hüllermeier.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 335 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hsu, BY., Tu, CL., Chang, MY. et al. CrawlSN: community-aware data acquisition with maximum willingness in online social networks. Data Min Knowl Disc 34, 1589–1620 (2020). https://doi.org/10.1007/s10618-020-00709-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-020-00709-5

Keywords

Navigation