Skip to main content
Log in

Improving biterm topic model with word embeddings

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

As one of the fundamental information extraction methods, topic model has been widely used in text clustering, information recommendation and other text analysis tasks. Conventional topic models mainly utilize word co-occurrence information in texts for topic inference. However, it is usually hard to extract a group of words that are semantically coherent and have competent representation ability when the models applied into short texts. It is because the feature space of the short texts is too sparse to provide enough co-occurrence information for topic inference. The continuous development of word embeddings brings new representation of words and more effective measurement of word semantic similarity from concept perspective. In this study, we first mine word co-occurrence patterns (i.e., biterms) from short text corpus and then calculate biterm frequency and semantic similarity between its two words. The result shows that a biterm with higher frequency or semantic similarity usually has more similar words in the corpus. Based on the result, we develop a novel probabilistic topic model, named Noise Biterm Topic Model with Word Embeddings (NBTMWE). NBTMWE extends the Biterm Topic Model (BTM) by introducing a noise topic with prior knowledge of frequency and semantic similarity of biterm. NBTMWE shows the following advantages compared with BTM: (1) It can distinguish meaningful latent topics from a noise topic which consists of some common-used words that appear in many texts of the dataset; (2) It can promote a biterm’s semantically related words to the same topic during the sampling process via generalized \(P\acute {o}lya\) Urn (GPU) model. Using auxiliary word embeddings trained from a large scale of corpus, we report the results testing on two short text datasets (i.e., Sina Weibo and Web Snippets). Quantitatively, NBTMWE outperforms the state-of-the-art models in terms of coherence, topic word similarity and classification accuracy. Qualitatively, each of the topics generated by NBTMWE contains more semantically similar words and shows superior intelligibility.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9

Similar content being viewed by others

Notes

  1. This dataset is available at https://github.com/Jenny-HJJ/NBTMWE/tree/Jenny-HJJ-master-dataset-sina

  2. The implementation of openCC is available at https://github.com/argolab/OpenCC

  3. This dataset is available at https://github.com/Jenny-HJJ/NBTMWE/tree/Jenny-HJJ-dataset-web-snippets

  4. Stop word list is downloaded from NLTK : http://www.nltk.org/

  5. The Chinese Wikipedia dataset is available at https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2

  6. The English Wikipedia dataset is available at https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

  7. The implementation of CBOW is available at https://code.google.com/p/word2vec

  8. Authors implementation of GPUDMM is available at https://github.com/NobodyWHU/GPUDMM

  9. Authors implementation of LFTM is available at https://github.com/datquocnguyen/LFTM

  10. The implementation of NBTMWE is available at https://github.com/Jenny-HJJ/NBTMWE

  11. The chi-scores of each word in the vocabulary of the dataset is calculated by sklearn.chi2.

  12. The implementation is available at http://scikit-learn.org/

References

  1. Amplayo, R.K., Lee, S., Song, M.: Incorporating product description to sentiment topic models for improved aspect-based sentiment analysis. Inf. Sci. 454-455, 200–215 (2018)

    Article  Google Scholar 

  2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J Mach Learn Res 3, 993–1022 (2003)

    MATH  Google Scholar 

  3. Chen, Z., Liu, B.: Topic modeling using topics from many domains, lifelong learning and big data. In: International conference on machine learning, pp. 703–711 (2014)

  4. Chen, Z., Liu, B.: Mining topics in documents: standing on the shoulders of big data. In: ACM conference on knowledge discovery and data mining, pp. 1116–1125 (2014)

  5. Cheng, X., Yan, X., Lan, Y., Guo, J.: BTM: Topic modeling over short texts. IEEE Trans Knowl Data Eng 26(12), 2928–2941 (2014)

    Article  Google Scholar 

  6. Das, R., Zaheer, M., Dyer, C.: Gaussian LDA for topic models with word embeddings. In: International joint conference on natural language processing, pp. 795–804 (2015)

  7. Fang, A., Macdonald, C., Ounis, I., Habel, P.: Using word embedding to evaluate the coherence of topics from twitter data. In: ACM SIGIR conference on research and development in information retrieval, pp. 1057–1060 (2016)

  8. Gao, W., Peng, M., Wang, H., Zhang, Y., Xie, Q., Tian, G.: Incorporating word embeddings into topic modeling of short text. Knowl. Inf. Syst. 61 (2), 1123–1145 (2019)

    Article  Google Scholar 

  9. Garciapablos, A., Cuadros, M., Rigau, G.: W2VLDA: Almost unsupervised system for aspect based sentiment analysis. Expert Systems with Applications: 127-137 (2018)

  10. Haigh, J.: Pólya urn models. Journal of the Royal Statistical Society:, Series A (Statistics in Society) 172(4), 942–942 (2009)

    Article  MathSciNet  Google Scholar 

  11. Hofmann, T.: Probabilistic latent semantic indexing. In: International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57 (1999)

  12. Huang, J., Peng, M., Wang, H., Cao, J., Gao, W., Zhang, X.: A probabilistic method for emerging topic tracking in microblog stream. World Wide Web 20(2), 325–350 (2017)

    Article  Google Scholar 

  13. Jiang, L., Lu, H., Xu, M., Wang, C.: Biterm pseudo document topic model for short text. In: IEEE International Conference on Tools with Artificial Intelligence, pp. 865–872 (2016)

  14. Li, S., Chua, T.S., Zhu, J., Miao, C.: Generative topic embedding: a continuous representation of documents. In: Annual Meeting of the Association for Computational Linguistics, pp. 666–675 (2016)

  15. Li, C., Duan, Y., Wang, H., Wang, H., Zhang, Z., Sun, A., Ma, Z.: Enhancing topic modeling for short texts with auxiliary word embeddings. ACM Trans. Inf. Syst. 36(2), 30 (2017)

    Article  Google Scholar 

  16. Li, C., Wang, H., Zhang, Z., Sun, A., Ma, Z.: Topic modeling for short texts with auxiliary word embeddings. In: International ACM SIGIR conference on research & development in information retrieval, pp. 165–174 (2016)

  17. Mehrotra, R., Sanner, S., Buntine, W., Xie, L.: Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: International ACM SIGIR conference on research & development in information retrieval, pp. 889–892 (2013)

  18. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: International Conference on Learning Representations, pp. 1–12 (2013)

  19. Mimno, D.M., Wallach, H.M., Talley, E.M., Leenders, M., Mccallum, A.: Optimizing semantic coherence in topic models. In: Conference on empirical methods in natural language processing, pp. 262–272 (2011)

  20. Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Human language technologies: Conference of the North American chapter of the association of computational linguistics, pp. 100–108 (2010)

  21. Nguyen, D.Q., Billingsley, R., Du, L., Johnson, M.: Improving topic models with latent feature word representations. Trans. Assoc. Comput. Linguistics 3, 299–313 (2015)

    Article  Google Scholar 

  22. Peng, M., Chen, D., Xie, Q., Zhang, Y., Wang, H., Hu, G., Gao, W., Zhang, Y.: Topic-net conversation model. In: International conference on Web information systems engineering, pp. 483–496 (2018)

  23. Peng, M., Gao, B., Zhu, J., Huang, J., Yuan, M., Li, F.: High quality information extraction and query-oriented summarization for automatic query-reply in social network. Exp. Syst. Appl. 44(2016), 92–101 (2016)

    Article  Google Scholar 

  24. Peng, M., Huang, J., Fu, H., Zhu, J., Zhou, L., He, Y.: High quality microblog extraction based on multiple features fusion and time-frequency transformation. In: International conference on web information systems engineering, pp. 188–201 (2013)

  25. Peng, M., Xie, Q., Zhang, Y., Wang, H., Zhang, X., Huang, J., Tian, G.: Neural sparse topical coding. In: Annual meeting of the association for computational, pp. 2332–2340 (2018)

  26. Peng, M., Zhu, J., Wang, H., Li, X., Zhang, Y., Zhang, X., Tian, G.: Mining event-oriented topics in microblog stream with unsupervised multi-view hierarchical embedding. ACM Trans. Knowl. Discov. Data 12(3), 1–26 (2018)

    Article  Google Scholar 

  27. Phan, X. H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & Web with hidden topics from large-scale data collections. In: International Conference on World Wide Web, pp. 91–100 (2008)

  28. Quan, X., Kit, C., Ge, Y., Pan, S.J.: Short and sparse text topic modeling via self-aggregation. In: International conference on artificial intelligence, pp. 2270–2276 (2015)

  29. Shi, B., Lam, W., Jameel, S., Schockaert, S., Lai, K.P.: Jointly learning word embeddings and latent topics. In: International ACM SIGIR conference on research & development in information retrieval, pp. 375–38 (2017)

  30. Sorokin, D., Gurevych, I.: Context-aware representations for knowledge base relation extraction. In: Conference on Empirical Methods in Natural Language Processing, pp. 1784–1789 (2017)

  31. Sun, A.: Short text classification using very few words. In: International ACM SIGIR conference on research & development in information retrieval, pp. 1145–1146 (2012)

  32. Wen, J., Tu, H., Cheng, X., Xie, R., Yin, W.: Joint modeling of users, questions and answers for answer selection in CQA. Exp. Syst. Appl. 118(2019), 563–572 (2019)

    Article  Google Scholar 

  33. Weng, J., Lim, E.P., Jiang, J., He, Q.: Twitterrank: Finding topic-sensitive influential Twitterers. In: ACM international conference on Web search and data mining, pp. 261–270 (2010)

  34. Wu, Z., Lei, L., Li, G., Huang, H., Zheng, C., Chen, E.: A topic modeling based approach to novel document automatic summarization. Exp. Syst. Appl. 84 (2017), 12–23 (2017)

    Article  Google Scholar 

  35. Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: International conference on World Wide Web, pp. 1445–1456 (2013)

  36. Yan, X., Guo, J., Lan, Y., Xu, J., Cheng, X.: A probabilistic model for bursty topic discovery in microblogs. In: AAAI conference on artificial intelligence, pp. 353–359 (2015)

  37. Yang, Y., Wang, F., Zhang, J., Xu, J., Yu, P.S.: A topic model for co-occurring normal documents and short texts. World Wide Web 21(2), 487–513 (2018)

    Article  Google Scholar 

  38. Yin, J., Wang, J.: A Dirichlet multinomial mixture model-based approach for short text clustering. In: ACM SIGKDD international conference on knowledge discovery & data mining, pp. 233–242 (2014)

Download references

Acknowledgements

The work was supported by the National Natural Science Foundation of China (NSFC, No.61802194, No.61902190) and Natural Science Foundation in University of Jiangsu Province, China (No.17KJB520015, No.19KJB520040).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pengwei Li.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, J., Peng, M., Li, P. et al. Improving biterm topic model with word embeddings. World Wide Web 23, 3099–3124 (2020). https://doi.org/10.1007/s11280-020-00823-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-020-00823-w

Keywords

Navigation