Characterization of topic-based online communities by combining network data and user generated content

Igarashi, Mirai; Terui, Nobuhiko

doi:10.1007/s11222-020-09947-5

Characterization of topic-based online communities by combining network data and user generated content

Published: 22 May 2020

Volume 30, pages 1309–1324, (2020)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

485 Accesses
1 Citation
Explore all metrics

Abstract

This study proposes a model for characterizing online communities by combining two types of data: network data and user-generated-content (UGC). The existing models for detecting the community structure of a network employ only network information. However, not all people connected in a network share the same interests. For instance, even if students belong to the same community of “school,” they may have various hobbies such as music, books, or sports. Hence, it is more realistic and beneficial for companies to identify communities according to their interests uncovered by their communications on social media. In addition, people may belong to multiple communities such as family, work, and online friends. Our model explores multiple overlapping communities according to their topics identified using two types of data jointly. By way of validating the main features of the proposed model, our simulation study shows that the model correctly identifies the community structure that could not be found without considering both network data and UGC. Furthermore, an empirical analysis using Twitter data clarifies that our model can find realistic and meaningful community structures from large online networks and has a good predictive performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Community Detection Through Topic Modeling in Social Networks

A Relationship Strength-Aware Topic Model for Communities Discovery in Online Social Networks

Detection of Topic Communities in Social Networks Based on Tri-LDA Model

Notes

In this study, we use the Gelman et al. (2013)’s scale with $-2n$ times Watanabe (2010)’s original definition (n is the number of data). This scale enables us to compare with other information criterion such as AIC and DIC
We confirmed that the majority of users posted about the presentation of a new game software, called Nintendo Direct, in March 2018. Hence, in this study, to avoid the effect of such text information commonly posted by many users, we decided to limit the period of data to be until February 28, 2018.

References

Airoldi, E.M., Blei, D.M., Fienberg, S.E., Xing, E.P.: Mixed membership stochastic blockmodels. J. Mach. Learn. Res. 9(SEP), 1981–2014 (2008)
MATH Google Scholar
Ansari, A., Stahl, F., Heitmann, M., Bremer, L.: Building a social network for success. J. Mark. Res. 55(3), 321–338 (2018)
Article Google Scholar
Barbillon, P., Donnet, S., Lazega, E., Bar-Hen, A.: Stochastic block models for multiplex networks: an application to a multilevel network of researchers. J. R. Stat. Soc. Ser. A Stat. Soc. 180(1), 295–314 (2017)
Article MathSciNet Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(4–5), 993–1022 (2003)
MATH Google Scholar
Bouveyron, C., Latouche, P., Zreik, R.: The stochastic topic block model for the clustering of vertices in networks with textual edges. Stat. Comput. 28(1), 11–31 (2018)
Article MathSciNet MATH Google Scholar
Chang, J., Blei, D.M.: Hierarchical relational models for document networks. Ann. Appl. Stat. 4(1), 124–150 (2010)
Article MathSciNet MATH Google Scholar
Chen, K., Lei, J.: Network cross-validation for determining the number of communities in network data. J. Am. Stat. Assoc. 113(521), 241–251 (2018)
Article MathSciNet MATH Google Scholar
Daudin, J.-J., Picard, F., Robin, S.: A mixture model for random graphs. Stat. Comput. 18(2), 173–183 (2008)
Article MathSciNet Google Scholar
Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B.: Bayesian Data Analysis, 3rd edn. Chapman and Hall/CRC, New York (2013)
Book MATH Google Scholar
Greene, D., Cunningham, P.: Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of the 23rd International Conference on Machine Learning - ICML ’06, pp. 377–384, ACM Press, New York, USA, (2006)
Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Natl. Acad. Sci. 101(Supplement 1), 5228–5235 (2004)
Article Google Scholar
Handcock, M.S., Raftery, A.E., Tantrum, J.M.: Model-based clustering for social networks. J. R. Stat. Soc. Ser. A Stat. Soc. 170(2), 301–354 (2007)
Article MathSciNet Google Scholar
Hoff, P.D., Raftery, A.E., Handcock, M.S.: Latent space approaches to social network analysis. J. Am. Stat. Assoc. 97(460), 1090–1098 (2002)
Article MathSciNet MATH Google Scholar
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
Article MATH Google Scholar
Jeong, H., Mason, S.P., Barabási, A.-L., Oltvai, Z.N.: Lethality and centrality in protein networks. Nature 411(6833), 41–42 (2001)
Article Google Scholar
Karrer, B., Newman, M.E.J.: Stochastic blockmodels and community structure in networks. Phys. Rev. E Stat. Nonlinear Soft Matter Phys. 83(1), 1–10 (2011)
Article MathSciNet Google Scholar
Krebs, V.E.: Mapping networks of terrorist cells. Connections 24(3), 43–52 (2002)
Google Scholar
Krivitsky, P.N., Handcock, M.S., Raftery, A.E., Hoff, P.D.: Representing degree distributions, clustering, and homophily in social networks with latent cluster random effects models. Soc. Netw. 31(3), 204–213 (2009)
Article Google Scholar
Latouche, P., Birmelé, E., Ambroise, C.: Variational Bayesian inference and complexity control for stochastic block models. Stat. Model. Int. J. 12(1), 93–115 (2012)
Article MathSciNet MATH Google Scholar
Latouche, P., Birmelé, E., Ambroise, C.: Overlapping stochastic block models with application to the French political blogosphere. Ann. Appl. Stat. 5(1), 309–336 (2011)
Article MathSciNet MATH Google Scholar
Liu, X., Bollen, J., Nelson, M.L., Van de Sompel, H.: Co-authorship networks in the digital library research community. Inf. Process. Manag. 41(6), 1462–1480 (2005)
Article Google Scholar
Liu, Y., Niculescu-Mizil, A., Gryc, W.: Topic-link LDA: joint models of topic and author community. In: Proceedings of the 26th International Conference On Machine Learning, ICML 2009, pp. 665–672 (2009)
Matias, C., Miele, V.: Statistical clustering of temporal networks through a dynamic stochastic block model. J. R. Stat. Soc. Ser. B Stat. Methodol. 79(4), 1119–1141 (2017)
Article MathSciNet MATH Google Scholar
McDaid, A.F., Murphy, T.B., Friel, N., Hurley, N.J.: Improved Bayesian inference for the stochastic block model with application to large networks. Comput. Stat. Data Anal. 60, 12–31 (2013)
Article MathSciNet MATH Google Scholar
Muller, E., Peres, R.: The effect of social networks structure on innovation performance: a review and directions for research. Int. J. Res. Mark. 36(1), 3–19 (2019)
Article Google Scholar
Newman, M.E.J.: Modularity and community structure in networks. Proc. Natl. Acad. Sci. 103(23), 8577–8582 (2006)
Article Google Scholar
Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: Advances in Neural Information Processing Systems, pp. 849–856 (2002)
Nowicki, K., Snijders, T.A.B.: Estimation and prediction for stochastic blockstructures. J. Am. Stat. Assoc. 96(455), 1077–1087 (2001)
Article MathSciNet MATH Google Scholar
Pathak, N., DeLong, C., Banerjee, A., Erickson, K.: Social Topic Models for Community Extraction. In: SNA-KDD workshop, p. 2008, (2008)
Peng, J., Agarwal, A., Hosanagar, K., Iyengar, R.: Network overlap and content sharing on social media platforms. J. Mark. Res. 55(4), 571–585 (2018)
Article Google Scholar
Peres, R.: The impact of network characteristics on the diffusion of innovations. Phys. A Stat. Mech. Appl. 402, 330–343 (2014)
Article Google Scholar
Saldaña, D.F., Yi, Y., Feng, Y.: How many communities are there? J. Comput. Gr. Stat. 26(1), 171–181 (2017)
Article MathSciNet Google Scholar
Snijders, T.A.B., Nowicki, K.: Estimation and prediction for stochastic blockmodels for graphs with latent block structure. J. Classif. 14(1), 75–100 (1997)
Article MathSciNet MATH Google Scholar
Wang, Y.J., Wong, G.Y.: Stochastic blockmodels for directed graphs. J. Am. Stat. Assoc. 82(397), 8–19 (1987)
Article MathSciNet MATH Google Scholar
Watanabe, S.: Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. J. Mach. Learn. Res. 11, 3571–3594 (2010)
MathSciNet MATH Google Scholar
Xing, E.P., Fu, W., Song, L.: A state-space mixed membership blockmodel for dynamic network tomography. Ann. Appl. Stat. 4(2), 535–566 (2010)
Article MathSciNet MATH Google Scholar
Xu, K.S., Hero, A.O.: Dynamic stochastic blockmodels for time-evolving social networks. IEEE J Sel. Top. Signal Process. 8(4), 552–562 (2014)
Article Google Scholar
Zanghi, H., Volant, S., Ambroise, C.: Clustering based on random graph model embedding vertex features. Pattern Recognit. Lett. 31(9), 830–836 (2010)
Article Google Scholar
Zhu, Y., Yan, X., Getoor, L., Moore, C.: Scalable text and link analysis with mixed-Topic link models. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 473–481, (2013)

Download references

Acknowledgements

Igarashi acknowledges a grant by JSPS KAKENHI 18J20698. Terui acknowledges a grant by JSPS KAKENHI (A) 17H01001.

Author information

Authors and Affiliations

Tohoku University, Sendai-shi, Miyagi, Japan
Mirai Igarashi & Nobuhiko Terui

Authors

Mirai Igarashi
View author publications
You can also search for this author in PubMed Google Scholar
Nobuhiko Terui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mirai Igarashi.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 154 KB)

Appendix

1.1 Appendix 1: Derivation of the collapsed Gibbs sampler for the proposed model

In Sect. 3.2, we derived the conditional posterior distributions of latent variables (Equations (4) and (5)). To derive these posteriors, we need the full conditional posterior distributions for model parameters, and these are given as follows:

$$\begin{aligned}&P(\eta _i | S, R, X, \gamma ) \nonumber \\&\quad = \frac{\varGamma \left( \sum _k N_{ik} + M_{ik} + \gamma _k \right) }{\prod _k \varGamma (N_{ik} + M_{ik} + \gamma _k)} \prod _{k=1}^{K} \eta _{ik}^{N_{ik} + M_{ik} + \gamma _k} \end{aligned}$$

(7)

$$\begin{aligned}&P(\psi _{kk'} | A, S, R, \delta , \epsilon ) \nonumber \\&\quad = \frac{\varGamma (n_{kk'}^{(+)} + n_{kk'}^{(-)} + \delta _{kk'} + \epsilon _{kk'})}{\varGamma (n_{kk'}^{(+)} + \delta _{kk'}) \varGamma (n_{kk'}^{(-)} + \epsilon _{kk'})} \nonumber \\&\qquad \times \psi _{kk'}^{{\mathbb {I}}(a_{ij}=1)} (1 - \psi _{kk'})^{{\mathbb {I}}(a_{ij}=0)} \end{aligned}$$

(8)

$$\begin{aligned}&P(\theta _k | X, Z, \alpha ) = \frac{\varGamma \left( \sum _l M_{kl} + \alpha _l \right) }{\varPi _l \varGamma (M_{kl} + \alpha _l)} \prod _{l=1}^{L} \theta _{kl}^{M_{kl} + \alpha _l} \end{aligned}$$

(9)

$$\begin{aligned}&P(\phi _l | W, Z, \beta ) = \frac{\varGamma \left( \sum _v M_{lv} + \beta _v \right) }{\varPi _v \varGamma (M_{lv} + \beta _v)} \prod _{v=1}^{V} \phi _{lv}^{M_{lv} + \beta _v}, \end{aligned}$$

(10)

where $N_{ik}$ is the count number of when node i is assigned community k on the edges from node i to other nodes and from other nodes to node i. $M_{ik}$ is the count number of when words in node i’s document are assigned to community k. $n_{kk'}^{(+)}$$(n_{kk'}^{(-)})$ is the number of links (non-links) from nodes in community k to nodes in community $k'$. $M_{kl}$ is the count number of when words are assigned to community k and topic l. $M_{lv}$ is the count number of when word v is assigned to topic l. $\varGamma $ is the gamma function, and ${\mathbb {I}}$ is the indicator function that returns 1, if the condition is satisfied, and 0 otherwise.

Collapsed Gibbs sampling repeats the sampling procedure according to Equations (4) and (5). The pseudo algorithm for the proposed model is provided in algorithm 1.

1.2 Appendix 2: Definition of WAIC for proposed model

The definition of WAIC for our model is as follows:

$$\begin{aligned} lpd^{(i)}&= \log \left( \frac{1}{G} \sum _{g=b+1}^{G} \prod _{j=1}^{D} P\left( a_{ij} | H^{(g)}, \varPsi ^{(g)}\right) \right. \nonumber \\&\quad \left. \prod _{m=1}^{M_i} P\left( w_{im} | H^{(g)}, \varTheta ^{(g)}, \varPhi ^{(g)}\right) \right) \end{aligned}$$

(11)

$$\begin{aligned} P_{waic}^{(i)}&= \frac{G}{G-1} \left( \frac{1}{G} \sum _{g=b+1}^{G} \left( \sum _{j=1}^{D} \log P\left( a_{ij} | H^{(g)}, \varPsi ^{(g)}\right) ^2 \right. \right. \nonumber \\&\quad \left. + \sum _{m=1}^{M_i} \log P\left( w_{im} | H^{(g)}, \varTheta ^{(g)}, \varPhi ^{(g)}\right) ^2\right) \nonumber \\&\quad - \left( \frac{1}{G} \sum _{g=b+1}^{G} \left( \sum _{j=1}^{D} \log P\left( a_{ij} | H^{(g)}, \varPsi ^{(g)}\right) \right. \right. \nonumber \\&\quad \left. \left. \left. + \sum _{m=1}^{M_i} \log P\left( w_{im} | H^{(g)}, \varTheta ^{(g)}, \varPhi ^{(g)}\right) \right) \right) ^2 \right) \end{aligned}$$

(12)

$$\begin{aligned} WAIC&= -2 \sum _{i=1}^{D} \left( lpd^{(i)} - P_{waic}^{(i)} \right) , \end{aligned}$$

(13)

where $P\left( a_{ij} | H^{(g)}, \varPsi ^{(g)}\right) $ and $P\left( w_{im} | H^{(g)}, \varTheta ^{(g)}, \varPhi ^{(g)}\right) $ are the model likelihood conditioned with the parameters estimated using samples at sth iteration

$$\begin{aligned}&P\left( a_{ij} | H^{(g)}, \varPsi ^{(g)}\right) \nonumber \\&\quad = \sum _{k=1}^{K} \sum _{k'=1}^{K} \eta _{ik} \cdot \eta _{jk'}^{(g)} \cdot \psi _{kk'}^{(g) {\mathbb {I}}(a_{ij}=1)} \cdot \left( 1 - \psi _{kk'}\right) ^{(g) {\mathbb {I}}(a_{ij}=0)} \end{aligned}$$

(14)

$$\begin{aligned}&P\left( w_{im} | H^{(g)}, \varTheta ^{(g)}, \varPhi ^{(g)}\right) = \sum _{k=1}^{K} \sum _{l=1}^{L} \eta _{ik}^{(g)} \cdot \theta _{kl}^{(g)} \cdot \phi _{lw_{im}}^{(g)}. \end{aligned}$$

(15)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Igarashi, M., Terui, N. Characterization of topic-based online communities by combining network data and user generated content. Stat Comput 30, 1309–1324 (2020). https://doi.org/10.1007/s11222-020-09947-5

Download citation

Received: 25 July 2019
Accepted: 10 May 2020
Published: 22 May 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s11222-020-09947-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Characterization of topic-based online communities by combining network data and user generated content

Abstract

Access this article

Similar content being viewed by others

Community Detection Through Topic Modeling in Social Networks

A Relationship Strength-Aware Topic Model for Communities Discovery in Online Social Networks

Detection of Topic Communities in Social Networks Based on Tri-LDA Model

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 154 KB)

Appendix

1.1 Appendix 1: Derivation of the collapsed Gibbs sampler for the proposed model

1.2 Appendix 2: Definition of WAIC for proposed model

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Characterization of topic-based online communities by combining network data and user generated content

Abstract

Access this article

Similar content being viewed by others

Community Detection Through Topic Modeling in Social Networks

A Relationship Strength-Aware Topic Model for Communities Discovery in Online Social Networks

Detection of Topic Communities in Social Networks Based on Tri-LDA Model

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 154 KB)

Appendix

Appendix

1.1 Appendix 1: Derivation of the collapsed Gibbs sampler for the proposed model

1.2 Appendix 2: Definition of WAIC for proposed model

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation