Improving biterm topic model with word embeddings

Huang, Jiajia; Peng, Min; Li, Pengwei; Hu, Zhiwei; Xu, Chao

doi:10.1007/s11280-020-00823-w

Improving biterm topic model with word embeddings

Published: 08 September 2020

Volume 23, pages 3099–3124, (2020)
Cite this article

World Wide Web Aims and scope Submit manuscript

Jiajia Huang¹,
Min Peng²,
Pengwei Li¹,
Zhiwei Hu³ &
…
Chao Xu¹

1174 Accesses
17 Citations
Explore all metrics

Abstract

As one of the fundamental information extraction methods, topic model has been widely used in text clustering, information recommendation and other text analysis tasks. Conventional topic models mainly utilize word co-occurrence information in texts for topic inference. However, it is usually hard to extract a group of words that are semantically coherent and have competent representation ability when the models applied into short texts. It is because the feature space of the short texts is too sparse to provide enough co-occurrence information for topic inference. The continuous development of word embeddings brings new representation of words and more effective measurement of word semantic similarity from concept perspective. In this study, we first mine word co-occurrence patterns (i.e., biterms) from short text corpus and then calculate biterm frequency and semantic similarity between its two words. The result shows that a biterm with higher frequency or semantic similarity usually has more similar words in the corpus. Based on the result, we develop a novel probabilistic topic model, named Noise Biterm Topic Model with Word Embeddings (NBTMWE). NBTMWE extends the Biterm Topic Model (BTM) by introducing a noise topic with prior knowledge of frequency and semantic similarity of biterm. NBTMWE shows the following advantages compared with BTM: (1) It can distinguish meaningful latent topics from a noise topic which consists of some common-used words that appear in many texts of the dataset; (2) It can promote a biterm’s semantically related words to the same topic during the sampling process via generalized \(P\acute {o}lya\) Urn (GPU) model. Using auxiliary word embeddings trained from a large scale of corpus, we report the results testing on two short text datasets (i.e., Sina Weibo and Web Snippets). Quantitatively, NBTMWE outperforms the state-of-the-art models in terms of coherence, topic word similarity and classification accuracy. Qualitatively, each of the topics generated by NBTMWE contains more semantically similar words and shows superior intelligibility.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Article 28 November 2018

Impact of word embedding models on text analytics in deep learning environment: a review

Article 22 February 2023

A Review on Word Embedding Techniques for Text Classification

Notes

This dataset is available at https://github.com/Jenny-HJJ/NBTMWE/tree/Jenny-HJJ-master-dataset-sina
The implementation of openCC is available at https://github.com/argolab/OpenCC
This dataset is available at https://github.com/Jenny-HJJ/NBTMWE/tree/Jenny-HJJ-dataset-web-snippets
Stop word list is downloaded from NLTK : http://www.nltk.org/
The Chinese Wikipedia dataset is available at https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2
The English Wikipedia dataset is available at https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
The implementation of CBOW is available at https://code.google.com/p/word2vec
Authors implementation of GPUDMM is available at https://github.com/NobodyWHU/GPUDMM
Authors implementation of LFTM is available at https://github.com/datquocnguyen/LFTM
The implementation of NBTMWE is available at https://github.com/Jenny-HJJ/NBTMWE
The chi-scores of each word in the vocabulary of the dataset is calculated by sklearn.chi2.
The implementation is available at http://scikit-learn.org/

References

Amplayo, R.K., Lee, S., Song, M.: Incorporating product description to sentiment topic models for improved aspect-based sentiment analysis. Inf. Sci. 454-455, 200–215 (2018)
Article Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J Mach Learn Res 3, 993–1022 (2003)
MATH Google Scholar
Chen, Z., Liu, B.: Topic modeling using topics from many domains, lifelong learning and big data. In: International conference on machine learning, pp. 703–711 (2014)
Chen, Z., Liu, B.: Mining topics in documents: standing on the shoulders of big data. In: ACM conference on knowledge discovery and data mining, pp. 1116–1125 (2014)
Cheng, X., Yan, X., Lan, Y., Guo, J.: BTM: Topic modeling over short texts. IEEE Trans Knowl Data Eng 26(12), 2928–2941 (2014)
Article Google Scholar
Das, R., Zaheer, M., Dyer, C.: Gaussian LDA for topic models with word embeddings. In: International joint conference on natural language processing, pp. 795–804 (2015)
Fang, A., Macdonald, C., Ounis, I., Habel, P.: Using word embedding to evaluate the coherence of topics from twitter data. In: ACM SIGIR conference on research and development in information retrieval, pp. 1057–1060 (2016)
Gao, W., Peng, M., Wang, H., Zhang, Y., Xie, Q., Tian, G.: Incorporating word embeddings into topic modeling of short text. Knowl. Inf. Syst. 61 (2), 1123–1145 (2019)
Article Google Scholar
Garciapablos, A., Cuadros, M., Rigau, G.: W2VLDA: Almost unsupervised system for aspect based sentiment analysis. Expert Systems with Applications: 127-137 (2018)
Haigh, J.: Pólya urn models. Journal of the Royal Statistical Society:, Series A (Statistics in Society) 172(4), 942–942 (2009)
Article MathSciNet Google Scholar
Hofmann, T.: Probabilistic latent semantic indexing. In: International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57 (1999)
Huang, J., Peng, M., Wang, H., Cao, J., Gao, W., Zhang, X.: A probabilistic method for emerging topic tracking in microblog stream. World Wide Web 20(2), 325–350 (2017)
Article Google Scholar
Jiang, L., Lu, H., Xu, M., Wang, C.: Biterm pseudo document topic model for short text. In: IEEE International Conference on Tools with Artificial Intelligence, pp. 865–872 (2016)
Li, S., Chua, T.S., Zhu, J., Miao, C.: Generative topic embedding: a continuous representation of documents. In: Annual Meeting of the Association for Computational Linguistics, pp. 666–675 (2016)
Li, C., Duan, Y., Wang, H., Wang, H., Zhang, Z., Sun, A., Ma, Z.: Enhancing topic modeling for short texts with auxiliary word embeddings. ACM Trans. Inf. Syst. 36(2), 30 (2017)
Article Google Scholar
Li, C., Wang, H., Zhang, Z., Sun, A., Ma, Z.: Topic modeling for short texts with auxiliary word embeddings. In: International ACM SIGIR conference on research & development in information retrieval, pp. 165–174 (2016)
Mehrotra, R., Sanner, S., Buntine, W., Xie, L.: Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: International ACM SIGIR conference on research & development in information retrieval, pp. 889–892 (2013)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: International Conference on Learning Representations, pp. 1–12 (2013)
Mimno, D.M., Wallach, H.M., Talley, E.M., Leenders, M., Mccallum, A.: Optimizing semantic coherence in topic models. In: Conference on empirical methods in natural language processing, pp. 262–272 (2011)
Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Human language technologies: Conference of the North American chapter of the association of computational linguistics, pp. 100–108 (2010)
Nguyen, D.Q., Billingsley, R., Du, L., Johnson, M.: Improving topic models with latent feature word representations. Trans. Assoc. Comput. Linguistics 3, 299–313 (2015)
Article Google Scholar
Peng, M., Chen, D., Xie, Q., Zhang, Y., Wang, H., Hu, G., Gao, W., Zhang, Y.: Topic-net conversation model. In: International conference on Web information systems engineering, pp. 483–496 (2018)
Peng, M., Gao, B., Zhu, J., Huang, J., Yuan, M., Li, F.: High quality information extraction and query-oriented summarization for automatic query-reply in social network. Exp. Syst. Appl. 44(2016), 92–101 (2016)
Article Google Scholar
Peng, M., Huang, J., Fu, H., Zhu, J., Zhou, L., He, Y.: High quality microblog extraction based on multiple features fusion and time-frequency transformation. In: International conference on web information systems engineering, pp. 188–201 (2013)
Peng, M., Xie, Q., Zhang, Y., Wang, H., Zhang, X., Huang, J., Tian, G.: Neural sparse topical coding. In: Annual meeting of the association for computational, pp. 2332–2340 (2018)
Peng, M., Zhu, J., Wang, H., Li, X., Zhang, Y., Zhang, X., Tian, G.: Mining event-oriented topics in microblog stream with unsupervised multi-view hierarchical embedding. ACM Trans. Knowl. Discov. Data 12(3), 1–26 (2018)
Article Google Scholar
Phan, X. H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & Web with hidden topics from large-scale data collections. In: International Conference on World Wide Web, pp. 91–100 (2008)
Quan, X., Kit, C., Ge, Y., Pan, S.J.: Short and sparse text topic modeling via self-aggregation. In: International conference on artificial intelligence, pp. 2270–2276 (2015)
Shi, B., Lam, W., Jameel, S., Schockaert, S., Lai, K.P.: Jointly learning word embeddings and latent topics. In: International ACM SIGIR conference on research & development in information retrieval, pp. 375–38 (2017)
Sorokin, D., Gurevych, I.: Context-aware representations for knowledge base relation extraction. In: Conference on Empirical Methods in Natural Language Processing, pp. 1784–1789 (2017)
Sun, A.: Short text classification using very few words. In: International ACM SIGIR conference on research & development in information retrieval, pp. 1145–1146 (2012)
Wen, J., Tu, H., Cheng, X., Xie, R., Yin, W.: Joint modeling of users, questions and answers for answer selection in CQA. Exp. Syst. Appl. 118(2019), 563–572 (2019)
Article Google Scholar
Weng, J., Lim, E.P., Jiang, J., He, Q.: Twitterrank: Finding topic-sensitive influential Twitterers. In: ACM international conference on Web search and data mining, pp. 261–270 (2010)
Wu, Z., Lei, L., Li, G., Huang, H., Zheng, C., Chen, E.: A topic modeling based approach to novel document automatic summarization. Exp. Syst. Appl. 84 (2017), 12–23 (2017)
Article Google Scholar
Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: International conference on World Wide Web, pp. 1445–1456 (2013)
Yan, X., Guo, J., Lan, Y., Xu, J., Cheng, X.: A probabilistic model for bursty topic discovery in microblogs. In: AAAI conference on artificial intelligence, pp. 353–359 (2015)
Yang, Y., Wang, F., Zhang, J., Xu, J., Yu, P.S.: A topic model for co-occurring normal documents and short texts. World Wide Web 21(2), 487–513 (2018)
Article Google Scholar
Yin, J., Wang, J.: A Dirichlet multinomial mixture model-based approach for short text clustering. In: ACM SIGKDD international conference on knowledge discovery & data mining, pp. 233–242 (2014)

Download references

Acknowledgements

The work was supported by the National Natural Science Foundation of China (NSFC, No.61802194, No.61902190) and Natural Science Foundation in University of Jiangsu Province, China (No.17KJB520015, No.19KJB520040).

Author information

Authors and Affiliations

Nanjing Audit University, Nanjing, 211815, China
Jiajia Huang, Pengwei Li & Chao Xu
Wuhan University, Wuhan, 430072, China
Min Peng
Shanxi Agricultural University, Datong, 030801, China
Zhiwei Hu

Authors

Jiajia Huang
View author publications
You can also search for this author in PubMed Google Scholar
Min Peng
View author publications
You can also search for this author in PubMed Google Scholar
Pengwei Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhiwei Hu
View author publications
You can also search for this author in PubMed Google Scholar
Chao Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pengwei Li.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, J., Peng, M., Li, P. et al. Improving biterm topic model with word embeddings. World Wide Web 23, 3099–3124 (2020). https://doi.org/10.1007/s11280-020-00823-w

Download citation

Received: 12 September 2019
Revised: 29 April 2020
Accepted: 04 May 2020
Published: 08 September 2020
Issue Date: November 2020
DOI: https://doi.org/10.1007/s11280-020-00823-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving biterm topic model with word embeddings

Abstract

Access this article

Similar content being viewed by others

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Impact of word embedding models on text analytics in deep learning environment: a review

A Review on Word Embedding Techniques for Text Classification

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improving biterm topic model with word embeddings

Abstract

Access this article

Similar content being viewed by others

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Impact of word embedding models on text analytics in deep learning environment: a review

A Review on Word Embedding Techniques for Text Classification

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation