Skip to main content
Log in

A Dirichlet process biterm-based mixture model for short text stream clustering

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Short text stream clustering has become an important problem for mining textual data in diverse social media platforms (e.g., Twitter). However, most of the existing clustering methods (e.g., LDA and PLSA) are developed based on the assumption of a static corpus of long texts, while little attention has been given to short text streams. Different from the long texts, the clustering of short texts is more challenging since their word co-occurrence pattern easily suffers from a sparsity problem. In this paper, we propose a Dirichlet process biterm-based mixture model (DP-BMM), which can deal with the topic drift problem and the sparsity problem in short text stream clustering. The major advantages of DP-BMM include (1) DP-BMM explicitly exploits the word-pairs constructed from each document to enhance the word co-occurrence pattern in short texts; (2) DP-BMM can deal with the topic drift problem of short text streams naturally. Moreover, we further propose an improved algorithm of DP-BMM with forgetting property called DP-BMM-FP, which can efficiently delete biterms of outdated documents by deleting clusters of outdated batches. To perform inference, we adopt an online Gibbs sampling method for parameter estimation. Our extensive experimental results on real-world datasets show that DP-BMM and DP-BMM-FP can achieve a better performance than the state-of-the-art methods in terms of NMI metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. https://news.google.com/news/

  2. https://trec.nist.gov/data/microblog.html

References

  1. Aggarwal CC (2013) A survey of stream clustering algorithms. In: Data clustering. Chapman and Hall/CRC, pp 231–258

  2. Ahmed A, Xing E (2008) Dynamic non-parametric mixture models and the recurrent chinese restaurant process: with applications to evolutionary clustering. In: Proceedings of the 2008 SIAM international conference on data mining. SIAM, pp 219–230

  3. Amoualian H, Clausel M, Gaussier E, Amini MR (2016) Streaming-lda: a copula-based approach to modeling topic dependencies in document streams. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 695–704

  4. Bernardo JM, Smith AF (2009) Bayesian theory, vol 405. Wiley, New York

    Google Scholar 

  5. Blei DM, Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd international conference on machine learning. ACM, pp 113–120

  6. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022

    MATH  Google Scholar 

  7. Cheng X, Yan X, Lan Y, Guo J (2014) Btm: Topic modeling over short texts. IEEE Trans Knowl Data Eng 26(12):2928–2941

    Article  Google Scholar 

  8. Doucet A, De Freitas N, Gordon N (2001) An introduction to sequential monte carlo methods. In: Sequential Monte Carlo methods in practice. Springer, pp 3–14

  9. Du N, Farajtabar M, Ahmed A, Smola AJ, Song L (2015) Dirichlet-Hawkes processes with applications to clustering continuous-time document streams. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 219–228

  10. Griffiths TL, Steyvers M (2004) Finding scientific topics. Proceedings of the National Academy of Sciences 101(suppl 1):5228–5235

    Article  Google Scholar 

  11. Guo J, Gong Z (2016) A nonparametric model for event discovery in the geospatial-temporal space. In: Proceedings of the 25th ACM international on conference on information and knowledge management. ACM, pp 499–508

  12. Hong L, Davison BD (2010) Empirical study of topic modeling in twitter. In: Proceedings of the first workshop on social media analytics. ACM, pp 80–88

  13. Hu J, Zheng W (2019) Transformation-gated lstm: efficient capture of short-term mutation dependencies for multivariate time series prediction tasks. In: 2019 International joint conference on neural networks (IJCNN). IEEE, pp 1–8

  14. Hu X, Wang H, Li P (2018) Online biterm topic model based short text stream classification using short text expansion and concept drifting detection. Pattern Recogn Lett 116:187–194

    Article  Google Scholar 

  15. Iwata T, Watanabe S, Yamada T, Ueda N (2009) Topic tracking model for analyzing consumer purchase behavior. In: IJCAI, vol 9, pp 1427–1432

  16. Li C, Wang H, Zhang Z, Sun A, Ma Z (2016) Topic modeling for short texts with auxiliary word embeddings. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 165–174

  17. Liang S, Yilmaz E, Kanoulas E (2016) Dynamic clustering of streaming short documents. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 995–1004

  18. Mahmoud H (2008) Pólya urn models. Chapman and Hall/CRC, London

    Book  Google Scholar 

  19. Mai K, Mai S, Nguyen A, Van Linh N, Than K (2016) Enabling hierarchical Dirichlet processes to work better for short texts at large scale. In: Pacific-asia conference on knowledge discovery and data mining. Springer, pp 431–442

  20. Nguyen HL, Woon YK, Ng WK (2015) A survey on data stream clustering and classification. Knowledge and Information Systems 45(3):535–569

    Article  Google Scholar 

  21. Pitman J, et al. (2002) Combinatorial stochastic processes. Tech. rep. Technical Report 621, Dept. Statistics, UC Berkeley. Lecture notes for St. Flour Summer School

  22. Quan X, Kit C, Ge Y, Pan SJ (2015) Short and sparse text topic modeling via self-aggregation. In: Twenty-fourth international joint conference on artificial intelligence

  23. Teh YW (2011) Dirichlet process. In: Encyclopedia of machine learning. Springer, pp 280–287

  24. Wang X, McCallum A (2006) Topics over time: a non-markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 424–433

  25. Wang Y, Wang M, Fujita H (2019) Word sense disambiguation: a comprehensive knowledge exploitation framework. Knowledge-based Systems, p 105030

  26. Yan X, Guo J, Lan Y, Cheng X (2013) A biterm topic model for short texts. In: Proceedings of the 22nd international conference on World Wide Web. ACM, pp 1445–1456

  27. Yin J, Chao D, Liu Z, Zhang W, Yu X, Wang J (2018) Model-based clustering of short text streams. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. ACM, pp 2634–2642

  28. Yin J, Wang J (2014) A dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 233–242

  29. Yin J, Wang J (2016) A model-based approach for text clustering with outlier detection. In: 2016 IEEE 32nd international conference on data engineering (ICDE). IEEE, pp 625–636

  30. Yuan C, Zhou W, Ma Q, Lv S, Han J, Hu S (2019) Learning review representations from user and product level information for spam detection. arXiv:1909.04455

  31. Zuo Y, Wu J, Zhang H, Lin H, Wang F, Xu K, Xiong H (2016) Topic modeling of short texts: a pseudo-document view. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 2105– 2114

Download references

Acknowledgments

This work was supported by the Science and Technology Development Fund, Macau SAR (SKL-IOTSC-2018-2020, FDCT/0045/2019/A1, FDCT/007/2016/AFJ), Guangzhou Science and Technology Innovation and Development Commission (EF005/FST-GZG/2019/GSTIC), Research Committee of University of Macau (MYRG2017-00212-FST , MYRG2018-00129-FST).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhiguo Gong.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, J., Gong, Z. & Liu, W. A Dirichlet process biterm-based mixture model for short text stream clustering. Appl Intell 50, 1609–1619 (2020). https://doi.org/10.1007/s10489-019-01606-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-019-01606-1

Keywords

Navigation