A Dirichlet process biterm-based mixture model for short text stream clustering

Chen, Junyang; Gong, Zhiguo; Liu, Weiwen

doi:10.1007/s10489-019-01606-1

A Dirichlet process biterm-based mixture model for short text stream clustering

Published: 01 February 2020

Volume 50, pages 1609–1619, (2020)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

1166 Accesses
38 Citations
Explore all metrics

Abstract

Short text stream clustering has become an important problem for mining textual data in diverse social media platforms (e.g., Twitter). However, most of the existing clustering methods (e.g., LDA and PLSA) are developed based on the assumption of a static corpus of long texts, while little attention has been given to short text streams. Different from the long texts, the clustering of short texts is more challenging since their word co-occurrence pattern easily suffers from a sparsity problem. In this paper, we propose a Dirichlet process biterm-based mixture model (DP-BMM), which can deal with the topic drift problem and the sparsity problem in short text stream clustering. The major advantages of DP-BMM include (1) DP-BMM explicitly exploits the word-pairs constructed from each document to enhance the word co-occurrence pattern in short texts; (2) DP-BMM can deal with the topic drift problem of short text streams naturally. Moreover, we further propose an improved algorithm of DP-BMM with forgetting property called DP-BMM-FP, which can efficiently delete biterms of outdated documents by deleting clusters of outdated batches. To perform inference, we adopt an online Gibbs sampling method for parameter estimation. Our extensive experimental results on real-world datasets show that DP-BMM and DP-BMM-FP can achieve a better performance than the state-of-the-art methods in terms of NMI metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Article 28 November 2018

Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis

Article 26 October 2022

A survey on neural topic models: methods, applications, and challenges

Article Open access 25 January 2024

Notes

References

Aggarwal CC (2013) A survey of stream clustering algorithms. In: Data clustering. Chapman and Hall/CRC, pp 231–258
Ahmed A, Xing E (2008) Dynamic non-parametric mixture models and the recurrent chinese restaurant process: with applications to evolutionary clustering. In: Proceedings of the 2008 SIAM international conference on data mining. SIAM, pp 219–230
Amoualian H, Clausel M, Gaussier E, Amini MR (2016) Streaming-lda: a copula-based approach to modeling topic dependencies in document streams. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 695–704
Bernardo JM, Smith AF (2009) Bayesian theory, vol 405. Wiley, New York
Google Scholar
Blei DM, Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd international conference on machine learning. ACM, pp 113–120
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022
MATH Google Scholar
Cheng X, Yan X, Lan Y, Guo J (2014) Btm: Topic modeling over short texts. IEEE Trans Knowl Data Eng 26(12):2928–2941
Article Google Scholar
Doucet A, De Freitas N, Gordon N (2001) An introduction to sequential monte carlo methods. In: Sequential Monte Carlo methods in practice. Springer, pp 3–14
Du N, Farajtabar M, Ahmed A, Smola AJ, Song L (2015) Dirichlet-Hawkes processes with applications to clustering continuous-time document streams. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 219–228
Griffiths TL, Steyvers M (2004) Finding scientific topics. Proceedings of the National Academy of Sciences 101(suppl 1):5228–5235
Article Google Scholar
Guo J, Gong Z (2016) A nonparametric model for event discovery in the geospatial-temporal space. In: Proceedings of the 25th ACM international on conference on information and knowledge management. ACM, pp 499–508
Hong L, Davison BD (2010) Empirical study of topic modeling in twitter. In: Proceedings of the first workshop on social media analytics. ACM, pp 80–88
Hu J, Zheng W (2019) Transformation-gated lstm: efficient capture of short-term mutation dependencies for multivariate time series prediction tasks. In: 2019 International joint conference on neural networks (IJCNN). IEEE, pp 1–8
Hu X, Wang H, Li P (2018) Online biterm topic model based short text stream classification using short text expansion and concept drifting detection. Pattern Recogn Lett 116:187–194
Article Google Scholar
Iwata T, Watanabe S, Yamada T, Ueda N (2009) Topic tracking model for analyzing consumer purchase behavior. In: IJCAI, vol 9, pp 1427–1432
Li C, Wang H, Zhang Z, Sun A, Ma Z (2016) Topic modeling for short texts with auxiliary word embeddings. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 165–174
Liang S, Yilmaz E, Kanoulas E (2016) Dynamic clustering of streaming short documents. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 995–1004
Mahmoud H (2008) Pólya urn models. Chapman and Hall/CRC, London
Book Google Scholar
Mai K, Mai S, Nguyen A, Van Linh N, Than K (2016) Enabling hierarchical Dirichlet processes to work better for short texts at large scale. In: Pacific-asia conference on knowledge discovery and data mining. Springer, pp 431–442
Nguyen HL, Woon YK, Ng WK (2015) A survey on data stream clustering and classification. Knowledge and Information Systems 45(3):535–569
Article Google Scholar
Pitman J, et al. (2002) Combinatorial stochastic processes. Tech. rep. Technical Report 621, Dept. Statistics, UC Berkeley. Lecture notes for St. Flour Summer School
Quan X, Kit C, Ge Y, Pan SJ (2015) Short and sparse text topic modeling via self-aggregation. In: Twenty-fourth international joint conference on artificial intelligence
Teh YW (2011) Dirichlet process. In: Encyclopedia of machine learning. Springer, pp 280–287
Wang X, McCallum A (2006) Topics over time: a non-markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 424–433
Wang Y, Wang M, Fujita H (2019) Word sense disambiguation: a comprehensive knowledge exploitation framework. Knowledge-based Systems, p 105030
Yan X, Guo J, Lan Y, Cheng X (2013) A biterm topic model for short texts. In: Proceedings of the 22nd international conference on World Wide Web. ACM, pp 1445–1456
Yin J, Chao D, Liu Z, Zhang W, Yu X, Wang J (2018) Model-based clustering of short text streams. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. ACM, pp 2634–2642
Yin J, Wang J (2014) A dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 233–242
Yin J, Wang J (2016) A model-based approach for text clustering with outlier detection. In: 2016 IEEE 32nd international conference on data engineering (ICDE). IEEE, pp 625–636
Yuan C, Zhou W, Ma Q, Lv S, Han J, Hu S (2019) Learning review representations from user and product level information for spam detection. arXiv:1909.04455
Zuo Y, Wu J, Zhang H, Lin H, Wang F, Xu K, Xiong H (2016) Topic modeling of short texts: a pseudo-document view. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 2105– 2114

Download references

Acknowledgments

This work was supported by the Science and Technology Development Fund, Macau SAR (SKL-IOTSC-2018-2020, FDCT/0045/2019/A1, FDCT/007/2016/AFJ), Guangzhou Science and Technology Innovation and Development Commission (EF005/FST-GZG/2019/GSTIC), Research Committee of University of Macau (MYRG2017-00212-FST , MYRG2018-00129-FST).

Author information

Authors and Affiliations

State Key Laboratory of Internet of Things for Smart City and Department of Computer and Information Science, University of Macau, Macao, People’s Republic of China
Junyang Chen & Zhiguo Gong
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, People’s Republic of China
Weiwen Liu

Authors

Junyang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Zhiguo Gong
View author publications
You can also search for this author in PubMed Google Scholar
Weiwen Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhiguo Gong.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, J., Gong, Z. & Liu, W. A Dirichlet process biterm-based mixture model for short text stream clustering. Appl Intell 50, 1609–1619 (2020). https://doi.org/10.1007/s10489-019-01606-1

Download citation

Published: 01 February 2020
Issue Date: May 2020
DOI: https://doi.org/10.1007/s10489-019-01606-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Dirichlet process biterm-based mixture model for short text stream clustering

Abstract

Access this article

Similar content being viewed by others

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis

A survey on neural topic models: methods, applications, and challenges

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Dirichlet process biterm-based mixture model for short text stream clustering

Abstract

Access this article

Similar content being viewed by others

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis

A survey on neural topic models: methods, applications, and challenges

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation