Skip to main content
Log in

Characterization of topic-based online communities by combining network data and user generated content

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

This study proposes a model for characterizing online communities by combining two types of data: network data and user-generated-content (UGC). The existing models for detecting the community structure of a network employ only network information. However, not all people connected in a network share the same interests. For instance, even if students belong to the same community of “school,” they may have various hobbies such as music, books, or sports. Hence, it is more realistic and beneficial for companies to identify communities according to their interests uncovered by their communications on social media. In addition, people may belong to multiple communities such as family, work, and online friends. Our model explores multiple overlapping communities according to their topics identified using two types of data jointly. By way of validating the main features of the proposed model, our simulation study shows that the model correctly identifies the community structure that could not be found without considering both network data and UGC. Furthermore, an empirical analysis using Twitter data clarifies that our model can find realistic and meaningful community structures from large online networks and has a good predictive performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. In this study, we use the Gelman et al. (2013)’s scale with \(-2n\) times Watanabe (2010)’s original definition (n is the number of data). This scale enables us to compare with other information criterion such as AIC and DIC

  2. We confirmed that the majority of users posted about the presentation of a new game software, called Nintendo Direct, in March 2018. Hence, in this study, to avoid the effect of such text information commonly posted by many users, we decided to limit the period of data to be until February 28, 2018.

References

  • Airoldi, E.M., Blei, D.M., Fienberg, S.E., Xing, E.P.: Mixed membership stochastic blockmodels. J. Mach. Learn. Res. 9(SEP), 1981–2014 (2008)

    MATH  Google Scholar 

  • Ansari, A., Stahl, F., Heitmann, M., Bremer, L.: Building a social network for success. J. Mark. Res. 55(3), 321–338 (2018)

    Article  Google Scholar 

  • Barbillon, P., Donnet, S., Lazega, E., Bar-Hen, A.: Stochastic block models for multiplex networks: an application to a multilevel network of researchers. J. R. Stat. Soc. Ser. A Stat. Soc. 180(1), 295–314 (2017)

    Article  MathSciNet  Google Scholar 

  • Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(4–5), 993–1022 (2003)

    MATH  Google Scholar 

  • Bouveyron, C., Latouche, P., Zreik, R.: The stochastic topic block model for the clustering of vertices in networks with textual edges. Stat. Comput. 28(1), 11–31 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  • Chang, J., Blei, D.M.: Hierarchical relational models for document networks. Ann. Appl. Stat. 4(1), 124–150 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  • Chen, K., Lei, J.: Network cross-validation for determining the number of communities in network data. J. Am. Stat. Assoc. 113(521), 241–251 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  • Daudin, J.-J., Picard, F., Robin, S.: A mixture model for random graphs. Stat. Comput. 18(2), 173–183 (2008)

    Article  MathSciNet  Google Scholar 

  • Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B.: Bayesian Data Analysis, 3rd edn. Chapman and Hall/CRC, New York (2013)

    Book  MATH  Google Scholar 

  • Greene, D., Cunningham, P.: Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of the 23rd International Conference on Machine Learning - ICML ’06, pp. 377–384, ACM Press, New York, USA, (2006)

  • Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Natl. Acad. Sci. 101(Supplement 1), 5228–5235 (2004)

    Article  Google Scholar 

  • Handcock, M.S., Raftery, A.E., Tantrum, J.M.: Model-based clustering for social networks. J. R. Stat. Soc. Ser. A Stat. Soc. 170(2), 301–354 (2007)

    Article  MathSciNet  Google Scholar 

  • Hoff, P.D., Raftery, A.E., Handcock, M.S.: Latent space approaches to social network analysis. J. Am. Stat. Assoc. 97(460), 1090–1098 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  • Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)

    Article  MATH  Google Scholar 

  • Jeong, H., Mason, S.P., Barabási, A.-L., Oltvai, Z.N.: Lethality and centrality in protein networks. Nature 411(6833), 41–42 (2001)

    Article  Google Scholar 

  • Karrer, B., Newman, M.E.J.: Stochastic blockmodels and community structure in networks. Phys. Rev. E Stat. Nonlinear Soft Matter Phys. 83(1), 1–10 (2011)

    Article  MathSciNet  Google Scholar 

  • Krebs, V.E.: Mapping networks of terrorist cells. Connections 24(3), 43–52 (2002)

    Google Scholar 

  • Krivitsky, P.N., Handcock, M.S., Raftery, A.E., Hoff, P.D.: Representing degree distributions, clustering, and homophily in social networks with latent cluster random effects models. Soc. Netw. 31(3), 204–213 (2009)

    Article  Google Scholar 

  • Latouche, P., Birmelé, E., Ambroise, C.: Variational Bayesian inference and complexity control for stochastic block models. Stat. Model. Int. J. 12(1), 93–115 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  • Latouche, P., Birmelé, E., Ambroise, C.: Overlapping stochastic block models with application to the French political blogosphere. Ann. Appl. Stat. 5(1), 309–336 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  • Liu, X., Bollen, J., Nelson, M.L., Van de Sompel, H.: Co-authorship networks in the digital library research community. Inf. Process. Manag. 41(6), 1462–1480 (2005)

    Article  Google Scholar 

  • Liu, Y., Niculescu-Mizil, A., Gryc, W.: Topic-link LDA: joint models of topic and author community. In: Proceedings of the 26th International Conference On Machine Learning, ICML 2009, pp. 665–672 (2009)

  • Matias, C., Miele, V.: Statistical clustering of temporal networks through a dynamic stochastic block model. J. R. Stat. Soc. Ser. B Stat. Methodol. 79(4), 1119–1141 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  • McDaid, A.F., Murphy, T.B., Friel, N., Hurley, N.J.: Improved Bayesian inference for the stochastic block model with application to large networks. Comput. Stat. Data Anal. 60, 12–31 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  • Muller, E., Peres, R.: The effect of social networks structure on innovation performance: a review and directions for research. Int. J. Res. Mark. 36(1), 3–19 (2019)

    Article  Google Scholar 

  • Newman, M.E.J.: Modularity and community structure in networks. Proc. Natl. Acad. Sci. 103(23), 8577–8582 (2006)

    Article  Google Scholar 

  • Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: Advances in Neural Information Processing Systems, pp. 849–856 (2002)

  • Nowicki, K., Snijders, T.A.B.: Estimation and prediction for stochastic blockstructures. J. Am. Stat. Assoc. 96(455), 1077–1087 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  • Pathak, N., DeLong, C., Banerjee, A., Erickson, K.: Social Topic Models for Community Extraction. In: SNA-KDD workshop, p. 2008, (2008)

  • Peng, J., Agarwal, A., Hosanagar, K., Iyengar, R.: Network overlap and content sharing on social media platforms. J. Mark. Res. 55(4), 571–585 (2018)

    Article  Google Scholar 

  • Peres, R.: The impact of network characteristics on the diffusion of innovations. Phys. A Stat. Mech. Appl. 402, 330–343 (2014)

    Article  Google Scholar 

  • Saldaña, D.F., Yi, Y., Feng, Y.: How many communities are there? J. Comput. Gr. Stat. 26(1), 171–181 (2017)

    Article  MathSciNet  Google Scholar 

  • Snijders, T.A.B., Nowicki, K.: Estimation and prediction for stochastic blockmodels for graphs with latent block structure. J. Classif. 14(1), 75–100 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  • Wang, Y.J., Wong, G.Y.: Stochastic blockmodels for directed graphs. J. Am. Stat. Assoc. 82(397), 8–19 (1987)

    Article  MathSciNet  MATH  Google Scholar 

  • Watanabe, S.: Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. J. Mach. Learn. Res. 11, 3571–3594 (2010)

    MathSciNet  MATH  Google Scholar 

  • Xing, E.P., Fu, W., Song, L.: A state-space mixed membership blockmodel for dynamic network tomography. Ann. Appl. Stat. 4(2), 535–566 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  • Xu, K.S., Hero, A.O.: Dynamic stochastic blockmodels for time-evolving social networks. IEEE J Sel. Top. Signal Process. 8(4), 552–562 (2014)

    Article  Google Scholar 

  • Zanghi, H., Volant, S., Ambroise, C.: Clustering based on random graph model embedding vertex features. Pattern Recognit. Lett. 31(9), 830–836 (2010)

    Article  Google Scholar 

  • Zhu, Y., Yan, X., Getoor, L., Moore, C.: Scalable text and link analysis with mixed-Topic link models. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 473–481, (2013)

Download references

Acknowledgements

Igarashi acknowledges a grant by JSPS KAKENHI 18J20698. Terui acknowledges a grant by JSPS KAKENHI (A) 17H01001.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mirai Igarashi.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 154 KB)

Appendix

Appendix

1.1 Appendix 1: Derivation of the collapsed Gibbs sampler for the proposed model

In Sect. 3.2, we derived the conditional posterior distributions of latent variables (Equations (4) and (5)). To derive these posteriors, we need the full conditional posterior distributions for model parameters, and these are given as follows:

$$\begin{aligned}&P(\eta _i | S, R, X, \gamma ) \nonumber \\&\quad = \frac{\varGamma \left( \sum _k N_{ik} + M_{ik} + \gamma _k \right) }{\prod _k \varGamma (N_{ik} + M_{ik} + \gamma _k)} \prod _{k=1}^{K} \eta _{ik}^{N_{ik} + M_{ik} + \gamma _k} \end{aligned}$$
(7)
$$\begin{aligned}&P(\psi _{kk'} | A, S, R, \delta , \epsilon ) \nonumber \\&\quad = \frac{\varGamma (n_{kk'}^{(+)} + n_{kk'}^{(-)} + \delta _{kk'} + \epsilon _{kk'})}{\varGamma (n_{kk'}^{(+)} + \delta _{kk'}) \varGamma (n_{kk'}^{(-)} + \epsilon _{kk'})} \nonumber \\&\qquad \times \psi _{kk'}^{{\mathbb {I}}(a_{ij}=1)} (1 - \psi _{kk'})^{{\mathbb {I}}(a_{ij}=0)} \end{aligned}$$
(8)
$$\begin{aligned}&P(\theta _k | X, Z, \alpha ) = \frac{\varGamma \left( \sum _l M_{kl} + \alpha _l \right) }{\varPi _l \varGamma (M_{kl} + \alpha _l)} \prod _{l=1}^{L} \theta _{kl}^{M_{kl} + \alpha _l} \end{aligned}$$
(9)
$$\begin{aligned}&P(\phi _l | W, Z, \beta ) = \frac{\varGamma \left( \sum _v M_{lv} + \beta _v \right) }{\varPi _v \varGamma (M_{lv} + \beta _v)} \prod _{v=1}^{V} \phi _{lv}^{M_{lv} + \beta _v}, \end{aligned}$$
(10)

where \(N_{ik}\) is the count number of when node i is assigned community k on the edges from node i to other nodes and from other nodes to node i. \(M_{ik}\) is the count number of when words in node i’s document are assigned to community k. \(n_{kk'}^{(+)}\)\((n_{kk'}^{(-)})\) is the number of links (non-links) from nodes in community k to nodes in community \(k'\). \(M_{kl}\) is the count number of when words are assigned to community k and topic l. \(M_{lv}\) is the count number of when word v is assigned to topic l. \(\varGamma \) is the gamma function, and \({\mathbb {I}}\) is the indicator function that returns 1, if the condition is satisfied, and 0 otherwise.

Collapsed Gibbs sampling repeats the sampling procedure according to Equations (4) and (5). The pseudo algorithm for the proposed model is provided in algorithm 1.

figure a

1.2 Appendix 2: Definition of WAIC for proposed model

The definition of WAIC for our model is as follows:

$$\begin{aligned} lpd^{(i)}&= \log \left( \frac{1}{G} \sum _{g=b+1}^{G} \prod _{j=1}^{D} P\left( a_{ij} | H^{(g)}, \varPsi ^{(g)}\right) \right. \nonumber \\&\quad \left. \prod _{m=1}^{M_i} P\left( w_{im} | H^{(g)}, \varTheta ^{(g)}, \varPhi ^{(g)}\right) \right) \end{aligned}$$
(11)
$$\begin{aligned} P_{waic}^{(i)}&= \frac{G}{G-1} \left( \frac{1}{G} \sum _{g=b+1}^{G} \left( \sum _{j=1}^{D} \log P\left( a_{ij} | H^{(g)}, \varPsi ^{(g)}\right) ^2 \right. \right. \nonumber \\&\quad \left. + \sum _{m=1}^{M_i} \log P\left( w_{im} | H^{(g)}, \varTheta ^{(g)}, \varPhi ^{(g)}\right) ^2\right) \nonumber \\&\quad - \left( \frac{1}{G} \sum _{g=b+1}^{G} \left( \sum _{j=1}^{D} \log P\left( a_{ij} | H^{(g)}, \varPsi ^{(g)}\right) \right. \right. \nonumber \\&\quad \left. \left. \left. + \sum _{m=1}^{M_i} \log P\left( w_{im} | H^{(g)}, \varTheta ^{(g)}, \varPhi ^{(g)}\right) \right) \right) ^2 \right) \end{aligned}$$
(12)
$$\begin{aligned} WAIC&= -2 \sum _{i=1}^{D} \left( lpd^{(i)} - P_{waic}^{(i)} \right) , \end{aligned}$$
(13)

where \(P\left( a_{ij} | H^{(g)}, \varPsi ^{(g)}\right) \) and \(P\left( w_{im} | H^{(g)}, \varTheta ^{(g)}, \varPhi ^{(g)}\right) \) are the model likelihood conditioned with the parameters estimated using samples at sth iteration

$$\begin{aligned}&P\left( a_{ij} | H^{(g)}, \varPsi ^{(g)}\right) \nonumber \\&\quad = \sum _{k=1}^{K} \sum _{k'=1}^{K} \eta _{ik} \cdot \eta _{jk'}^{(g)} \cdot \psi _{kk'}^{(g) {\mathbb {I}}(a_{ij}=1)} \cdot \left( 1 - \psi _{kk'}\right) ^{(g) {\mathbb {I}}(a_{ij}=0)} \end{aligned}$$
(14)
$$\begin{aligned}&P\left( w_{im} | H^{(g)}, \varTheta ^{(g)}, \varPhi ^{(g)}\right) = \sum _{k=1}^{K} \sum _{l=1}^{L} \eta _{ik}^{(g)} \cdot \theta _{kl}^{(g)} \cdot \phi _{lw_{im}}^{(g)}. \end{aligned}$$
(15)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Igarashi, M., Terui, N. Characterization of topic-based online communities by combining network data and user generated content. Stat Comput 30, 1309–1324 (2020). https://doi.org/10.1007/s11222-020-09947-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-020-09947-5

Keywords

Navigation