Abstract
Short text stream clustering has become an important problem for mining textual data in diverse social media platforms (e.g., Twitter). However, most of the existing clustering methods (e.g., LDA and PLSA) are developed based on the assumption of a static corpus of long texts, while little attention has been given to short text streams. Different from the long texts, the clustering of short texts is more challenging since their word co-occurrence pattern easily suffers from a sparsity problem. In this paper, we propose a Dirichlet process biterm-based mixture model (DP-BMM), which can deal with the topic drift problem and the sparsity problem in short text stream clustering. The major advantages of DP-BMM include (1) DP-BMM explicitly exploits the word-pairs constructed from each document to enhance the word co-occurrence pattern in short texts; (2) DP-BMM can deal with the topic drift problem of short text streams naturally. Moreover, we further propose an improved algorithm of DP-BMM with forgetting property called DP-BMM-FP, which can efficiently delete biterms of outdated documents by deleting clusters of outdated batches. To perform inference, we adopt an online Gibbs sampling method for parameter estimation. Our extensive experimental results on real-world datasets show that DP-BMM and DP-BMM-FP can achieve a better performance than the state-of-the-art methods in terms of NMI metrics.
Similar content being viewed by others
References
Aggarwal CC (2013) A survey of stream clustering algorithms. In: Data clustering. Chapman and Hall/CRC, pp 231–258
Ahmed A, Xing E (2008) Dynamic non-parametric mixture models and the recurrent chinese restaurant process: with applications to evolutionary clustering. In: Proceedings of the 2008 SIAM international conference on data mining. SIAM, pp 219–230
Amoualian H, Clausel M, Gaussier E, Amini MR (2016) Streaming-lda: a copula-based approach to modeling topic dependencies in document streams. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 695–704
Bernardo JM, Smith AF (2009) Bayesian theory, vol 405. Wiley, New York
Blei DM, Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd international conference on machine learning. ACM, pp 113–120
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022
Cheng X, Yan X, Lan Y, Guo J (2014) Btm: Topic modeling over short texts. IEEE Trans Knowl Data Eng 26(12):2928–2941
Doucet A, De Freitas N, Gordon N (2001) An introduction to sequential monte carlo methods. In: Sequential Monte Carlo methods in practice. Springer, pp 3–14
Du N, Farajtabar M, Ahmed A, Smola AJ, Song L (2015) Dirichlet-Hawkes processes with applications to clustering continuous-time document streams. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 219–228
Griffiths TL, Steyvers M (2004) Finding scientific topics. Proceedings of the National Academy of Sciences 101(suppl 1):5228–5235
Guo J, Gong Z (2016) A nonparametric model for event discovery in the geospatial-temporal space. In: Proceedings of the 25th ACM international on conference on information and knowledge management. ACM, pp 499–508
Hong L, Davison BD (2010) Empirical study of topic modeling in twitter. In: Proceedings of the first workshop on social media analytics. ACM, pp 80–88
Hu J, Zheng W (2019) Transformation-gated lstm: efficient capture of short-term mutation dependencies for multivariate time series prediction tasks. In: 2019 International joint conference on neural networks (IJCNN). IEEE, pp 1–8
Hu X, Wang H, Li P (2018) Online biterm topic model based short text stream classification using short text expansion and concept drifting detection. Pattern Recogn Lett 116:187–194
Iwata T, Watanabe S, Yamada T, Ueda N (2009) Topic tracking model for analyzing consumer purchase behavior. In: IJCAI, vol 9, pp 1427–1432
Li C, Wang H, Zhang Z, Sun A, Ma Z (2016) Topic modeling for short texts with auxiliary word embeddings. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 165–174
Liang S, Yilmaz E, Kanoulas E (2016) Dynamic clustering of streaming short documents. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 995–1004
Mahmoud H (2008) Pólya urn models. Chapman and Hall/CRC, London
Mai K, Mai S, Nguyen A, Van Linh N, Than K (2016) Enabling hierarchical Dirichlet processes to work better for short texts at large scale. In: Pacific-asia conference on knowledge discovery and data mining. Springer, pp 431–442
Nguyen HL, Woon YK, Ng WK (2015) A survey on data stream clustering and classification. Knowledge and Information Systems 45(3):535–569
Pitman J, et al. (2002) Combinatorial stochastic processes. Tech. rep. Technical Report 621, Dept. Statistics, UC Berkeley. Lecture notes for St. Flour Summer School
Quan X, Kit C, Ge Y, Pan SJ (2015) Short and sparse text topic modeling via self-aggregation. In: Twenty-fourth international joint conference on artificial intelligence
Teh YW (2011) Dirichlet process. In: Encyclopedia of machine learning. Springer, pp 280–287
Wang X, McCallum A (2006) Topics over time: a non-markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 424–433
Wang Y, Wang M, Fujita H (2019) Word sense disambiguation: a comprehensive knowledge exploitation framework. Knowledge-based Systems, p 105030
Yan X, Guo J, Lan Y, Cheng X (2013) A biterm topic model for short texts. In: Proceedings of the 22nd international conference on World Wide Web. ACM, pp 1445–1456
Yin J, Chao D, Liu Z, Zhang W, Yu X, Wang J (2018) Model-based clustering of short text streams. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. ACM, pp 2634–2642
Yin J, Wang J (2014) A dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 233–242
Yin J, Wang J (2016) A model-based approach for text clustering with outlier detection. In: 2016 IEEE 32nd international conference on data engineering (ICDE). IEEE, pp 625–636
Yuan C, Zhou W, Ma Q, Lv S, Han J, Hu S (2019) Learning review representations from user and product level information for spam detection. arXiv:1909.04455
Zuo Y, Wu J, Zhang H, Lin H, Wang F, Xu K, Xiong H (2016) Topic modeling of short texts: a pseudo-document view. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 2105– 2114
Acknowledgments
This work was supported by the Science and Technology Development Fund, Macau SAR (SKL-IOTSC-2018-2020, FDCT/0045/2019/A1, FDCT/007/2016/AFJ), Guangzhou Science and Technology Innovation and Development Commission (EF005/FST-GZG/2019/GSTIC), Research Committee of University of Macau (MYRG2017-00212-FST , MYRG2018-00129-FST).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Chen, J., Gong, Z. & Liu, W. A Dirichlet process biterm-based mixture model for short text stream clustering. Appl Intell 50, 1609–1619 (2020). https://doi.org/10.1007/s10489-019-01606-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-019-01606-1