Abstract
This study proposes a model for characterizing online communities by combining two types of data: network data and user-generated-content (UGC). The existing models for detecting the community structure of a network employ only network information. However, not all people connected in a network share the same interests. For instance, even if students belong to the same community of “school,” they may have various hobbies such as music, books, or sports. Hence, it is more realistic and beneficial for companies to identify communities according to their interests uncovered by their communications on social media. In addition, people may belong to multiple communities such as family, work, and online friends. Our model explores multiple overlapping communities according to their topics identified using two types of data jointly. By way of validating the main features of the proposed model, our simulation study shows that the model correctly identifies the community structure that could not be found without considering both network data and UGC. Furthermore, an empirical analysis using Twitter data clarifies that our model can find realistic and meaningful community structures from large online networks and has a good predictive performance.
Similar content being viewed by others
Notes
We confirmed that the majority of users posted about the presentation of a new game software, called Nintendo Direct, in March 2018. Hence, in this study, to avoid the effect of such text information commonly posted by many users, we decided to limit the period of data to be until February 28, 2018.
References
Airoldi, E.M., Blei, D.M., Fienberg, S.E., Xing, E.P.: Mixed membership stochastic blockmodels. J. Mach. Learn. Res. 9(SEP), 1981–2014 (2008)
Ansari, A., Stahl, F., Heitmann, M., Bremer, L.: Building a social network for success. J. Mark. Res. 55(3), 321–338 (2018)
Barbillon, P., Donnet, S., Lazega, E., Bar-Hen, A.: Stochastic block models for multiplex networks: an application to a multilevel network of researchers. J. R. Stat. Soc. Ser. A Stat. Soc. 180(1), 295–314 (2017)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(4–5), 993–1022 (2003)
Bouveyron, C., Latouche, P., Zreik, R.: The stochastic topic block model for the clustering of vertices in networks with textual edges. Stat. Comput. 28(1), 11–31 (2018)
Chang, J., Blei, D.M.: Hierarchical relational models for document networks. Ann. Appl. Stat. 4(1), 124–150 (2010)
Chen, K., Lei, J.: Network cross-validation for determining the number of communities in network data. J. Am. Stat. Assoc. 113(521), 241–251 (2018)
Daudin, J.-J., Picard, F., Robin, S.: A mixture model for random graphs. Stat. Comput. 18(2), 173–183 (2008)
Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B.: Bayesian Data Analysis, 3rd edn. Chapman and Hall/CRC, New York (2013)
Greene, D., Cunningham, P.: Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of the 23rd International Conference on Machine Learning - ICML ’06, pp. 377–384, ACM Press, New York, USA, (2006)
Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Natl. Acad. Sci. 101(Supplement 1), 5228–5235 (2004)
Handcock, M.S., Raftery, A.E., Tantrum, J.M.: Model-based clustering for social networks. J. R. Stat. Soc. Ser. A Stat. Soc. 170(2), 301–354 (2007)
Hoff, P.D., Raftery, A.E., Handcock, M.S.: Latent space approaches to social network analysis. J. Am. Stat. Assoc. 97(460), 1090–1098 (2002)
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
Jeong, H., Mason, S.P., Barabási, A.-L., Oltvai, Z.N.: Lethality and centrality in protein networks. Nature 411(6833), 41–42 (2001)
Karrer, B., Newman, M.E.J.: Stochastic blockmodels and community structure in networks. Phys. Rev. E Stat. Nonlinear Soft Matter Phys. 83(1), 1–10 (2011)
Krebs, V.E.: Mapping networks of terrorist cells. Connections 24(3), 43–52 (2002)
Krivitsky, P.N., Handcock, M.S., Raftery, A.E., Hoff, P.D.: Representing degree distributions, clustering, and homophily in social networks with latent cluster random effects models. Soc. Netw. 31(3), 204–213 (2009)
Latouche, P., Birmelé, E., Ambroise, C.: Variational Bayesian inference and complexity control for stochastic block models. Stat. Model. Int. J. 12(1), 93–115 (2012)
Latouche, P., Birmelé, E., Ambroise, C.: Overlapping stochastic block models with application to the French political blogosphere. Ann. Appl. Stat. 5(1), 309–336 (2011)
Liu, X., Bollen, J., Nelson, M.L., Van de Sompel, H.: Co-authorship networks in the digital library research community. Inf. Process. Manag. 41(6), 1462–1480 (2005)
Liu, Y., Niculescu-Mizil, A., Gryc, W.: Topic-link LDA: joint models of topic and author community. In: Proceedings of the 26th International Conference On Machine Learning, ICML 2009, pp. 665–672 (2009)
Matias, C., Miele, V.: Statistical clustering of temporal networks through a dynamic stochastic block model. J. R. Stat. Soc. Ser. B Stat. Methodol. 79(4), 1119–1141 (2017)
McDaid, A.F., Murphy, T.B., Friel, N., Hurley, N.J.: Improved Bayesian inference for the stochastic block model with application to large networks. Comput. Stat. Data Anal. 60, 12–31 (2013)
Muller, E., Peres, R.: The effect of social networks structure on innovation performance: a review and directions for research. Int. J. Res. Mark. 36(1), 3–19 (2019)
Newman, M.E.J.: Modularity and community structure in networks. Proc. Natl. Acad. Sci. 103(23), 8577–8582 (2006)
Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: Advances in Neural Information Processing Systems, pp. 849–856 (2002)
Nowicki, K., Snijders, T.A.B.: Estimation and prediction for stochastic blockstructures. J. Am. Stat. Assoc. 96(455), 1077–1087 (2001)
Pathak, N., DeLong, C., Banerjee, A., Erickson, K.: Social Topic Models for Community Extraction. In: SNA-KDD workshop, p. 2008, (2008)
Peng, J., Agarwal, A., Hosanagar, K., Iyengar, R.: Network overlap and content sharing on social media platforms. J. Mark. Res. 55(4), 571–585 (2018)
Peres, R.: The impact of network characteristics on the diffusion of innovations. Phys. A Stat. Mech. Appl. 402, 330–343 (2014)
Saldaña, D.F., Yi, Y., Feng, Y.: How many communities are there? J. Comput. Gr. Stat. 26(1), 171–181 (2017)
Snijders, T.A.B., Nowicki, K.: Estimation and prediction for stochastic blockmodels for graphs with latent block structure. J. Classif. 14(1), 75–100 (1997)
Wang, Y.J., Wong, G.Y.: Stochastic blockmodels for directed graphs. J. Am. Stat. Assoc. 82(397), 8–19 (1987)
Watanabe, S.: Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. J. Mach. Learn. Res. 11, 3571–3594 (2010)
Xing, E.P., Fu, W., Song, L.: A state-space mixed membership blockmodel for dynamic network tomography. Ann. Appl. Stat. 4(2), 535–566 (2010)
Xu, K.S., Hero, A.O.: Dynamic stochastic blockmodels for time-evolving social networks. IEEE J Sel. Top. Signal Process. 8(4), 552–562 (2014)
Zanghi, H., Volant, S., Ambroise, C.: Clustering based on random graph model embedding vertex features. Pattern Recognit. Lett. 31(9), 830–836 (2010)
Zhu, Y., Yan, X., Getoor, L., Moore, C.: Scalable text and link analysis with mixed-Topic link models. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 473–481, (2013)
Acknowledgements
Igarashi acknowledges a grant by JSPS KAKENHI 18J20698. Terui acknowledges a grant by JSPS KAKENHI (A) 17H01001.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix
Appendix
1.1 Appendix 1: Derivation of the collapsed Gibbs sampler for the proposed model
In Sect. 3.2, we derived the conditional posterior distributions of latent variables (Equations (4) and (5)). To derive these posteriors, we need the full conditional posterior distributions for model parameters, and these are given as follows:
where \(N_{ik}\) is the count number of when node i is assigned community k on the edges from node i to other nodes and from other nodes to node i. \(M_{ik}\) is the count number of when words in node i’s document are assigned to community k. \(n_{kk'}^{(+)}\)\((n_{kk'}^{(-)})\) is the number of links (non-links) from nodes in community k to nodes in community \(k'\). \(M_{kl}\) is the count number of when words are assigned to community k and topic l. \(M_{lv}\) is the count number of when word v is assigned to topic l. \(\varGamma \) is the gamma function, and \({\mathbb {I}}\) is the indicator function that returns 1, if the condition is satisfied, and 0 otherwise.
Collapsed Gibbs sampling repeats the sampling procedure according to Equations (4) and (5). The pseudo algorithm for the proposed model is provided in algorithm 1.
1.2 Appendix 2: Definition of WAIC for proposed model
The definition of WAIC for our model is as follows:
where \(P\left( a_{ij} | H^{(g)}, \varPsi ^{(g)}\right) \) and \(P\left( w_{im} | H^{(g)}, \varTheta ^{(g)}, \varPhi ^{(g)}\right) \) are the model likelihood conditioned with the parameters estimated using samples at sth iteration
Rights and permissions
About this article
Cite this article
Igarashi, M., Terui, N. Characterization of topic-based online communities by combining network data and user generated content. Stat Comput 30, 1309–1324 (2020). https://doi.org/10.1007/s11222-020-09947-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-020-09947-5