Abstract
As one of the fundamental information extraction methods, topic model has been widely used in text clustering, information recommendation and other text analysis tasks. Conventional topic models mainly utilize word co-occurrence information in texts for topic inference. However, it is usually hard to extract a group of words that are semantically coherent and have competent representation ability when the models applied into short texts. It is because the feature space of the short texts is too sparse to provide enough co-occurrence information for topic inference. The continuous development of word embeddings brings new representation of words and more effective measurement of word semantic similarity from concept perspective. In this study, we first mine word co-occurrence patterns (i.e., biterms) from short text corpus and then calculate biterm frequency and semantic similarity between its two words. The result shows that a biterm with higher frequency or semantic similarity usually has more similar words in the corpus. Based on the result, we develop a novel probabilistic topic model, named Noise Biterm Topic Model with Word Embeddings (NBTMWE). NBTMWE extends the Biterm Topic Model (BTM) by introducing a noise topic with prior knowledge of frequency and semantic similarity of biterm. NBTMWE shows the following advantages compared with BTM: (1) It can distinguish meaningful latent topics from a noise topic which consists of some common-used words that appear in many texts of the dataset; (2) It can promote a biterm’s semantically related words to the same topic during the sampling process via generalized \(P\acute {o}lya\) Urn (GPU) model. Using auxiliary word embeddings trained from a large scale of corpus, we report the results testing on two short text datasets (i.e., Sina Weibo and Web Snippets). Quantitatively, NBTMWE outperforms the state-of-the-art models in terms of coherence, topic word similarity and classification accuracy. Qualitatively, each of the topics generated by NBTMWE contains more semantically similar words and shows superior intelligibility.
Similar content being viewed by others
Notes
This dataset is available at https://github.com/Jenny-HJJ/NBTMWE/tree/Jenny-HJJ-master-dataset-sina
The implementation of openCC is available at https://github.com/argolab/OpenCC
This dataset is available at https://github.com/Jenny-HJJ/NBTMWE/tree/Jenny-HJJ-dataset-web-snippets
Stop word list is downloaded from NLTK : http://www.nltk.org/
The Chinese Wikipedia dataset is available at https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2
The English Wikipedia dataset is available at https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
The implementation of CBOW is available at https://code.google.com/p/word2vec
Authors implementation of GPUDMM is available at https://github.com/NobodyWHU/GPUDMM
Authors implementation of LFTM is available at https://github.com/datquocnguyen/LFTM
The implementation of NBTMWE is available at https://github.com/Jenny-HJJ/NBTMWE
The chi-scores of each word in the vocabulary of the dataset is calculated by sklearn.chi2.
The implementation is available at http://scikit-learn.org/
References
Amplayo, R.K., Lee, S., Song, M.: Incorporating product description to sentiment topic models for improved aspect-based sentiment analysis. Inf. Sci. 454-455, 200–215 (2018)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J Mach Learn Res 3, 993–1022 (2003)
Chen, Z., Liu, B.: Topic modeling using topics from many domains, lifelong learning and big data. In: International conference on machine learning, pp. 703–711 (2014)
Chen, Z., Liu, B.: Mining topics in documents: standing on the shoulders of big data. In: ACM conference on knowledge discovery and data mining, pp. 1116–1125 (2014)
Cheng, X., Yan, X., Lan, Y., Guo, J.: BTM: Topic modeling over short texts. IEEE Trans Knowl Data Eng 26(12), 2928–2941 (2014)
Das, R., Zaheer, M., Dyer, C.: Gaussian LDA for topic models with word embeddings. In: International joint conference on natural language processing, pp. 795–804 (2015)
Fang, A., Macdonald, C., Ounis, I., Habel, P.: Using word embedding to evaluate the coherence of topics from twitter data. In: ACM SIGIR conference on research and development in information retrieval, pp. 1057–1060 (2016)
Gao, W., Peng, M., Wang, H., Zhang, Y., Xie, Q., Tian, G.: Incorporating word embeddings into topic modeling of short text. Knowl. Inf. Syst. 61 (2), 1123–1145 (2019)
Garciapablos, A., Cuadros, M., Rigau, G.: W2VLDA: Almost unsupervised system for aspect based sentiment analysis. Expert Systems with Applications: 127-137 (2018)
Haigh, J.: Pólya urn models. Journal of the Royal Statistical Society:, Series A (Statistics in Society) 172(4), 942–942 (2009)
Hofmann, T.: Probabilistic latent semantic indexing. In: International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57 (1999)
Huang, J., Peng, M., Wang, H., Cao, J., Gao, W., Zhang, X.: A probabilistic method for emerging topic tracking in microblog stream. World Wide Web 20(2), 325–350 (2017)
Jiang, L., Lu, H., Xu, M., Wang, C.: Biterm pseudo document topic model for short text. In: IEEE International Conference on Tools with Artificial Intelligence, pp. 865–872 (2016)
Li, S., Chua, T.S., Zhu, J., Miao, C.: Generative topic embedding: a continuous representation of documents. In: Annual Meeting of the Association for Computational Linguistics, pp. 666–675 (2016)
Li, C., Duan, Y., Wang, H., Wang, H., Zhang, Z., Sun, A., Ma, Z.: Enhancing topic modeling for short texts with auxiliary word embeddings. ACM Trans. Inf. Syst. 36(2), 30 (2017)
Li, C., Wang, H., Zhang, Z., Sun, A., Ma, Z.: Topic modeling for short texts with auxiliary word embeddings. In: International ACM SIGIR conference on research & development in information retrieval, pp. 165–174 (2016)
Mehrotra, R., Sanner, S., Buntine, W., Xie, L.: Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: International ACM SIGIR conference on research & development in information retrieval, pp. 889–892 (2013)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: International Conference on Learning Representations, pp. 1–12 (2013)
Mimno, D.M., Wallach, H.M., Talley, E.M., Leenders, M., Mccallum, A.: Optimizing semantic coherence in topic models. In: Conference on empirical methods in natural language processing, pp. 262–272 (2011)
Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Human language technologies: Conference of the North American chapter of the association of computational linguistics, pp. 100–108 (2010)
Nguyen, D.Q., Billingsley, R., Du, L., Johnson, M.: Improving topic models with latent feature word representations. Trans. Assoc. Comput. Linguistics 3, 299–313 (2015)
Peng, M., Chen, D., Xie, Q., Zhang, Y., Wang, H., Hu, G., Gao, W., Zhang, Y.: Topic-net conversation model. In: International conference on Web information systems engineering, pp. 483–496 (2018)
Peng, M., Gao, B., Zhu, J., Huang, J., Yuan, M., Li, F.: High quality information extraction and query-oriented summarization for automatic query-reply in social network. Exp. Syst. Appl. 44(2016), 92–101 (2016)
Peng, M., Huang, J., Fu, H., Zhu, J., Zhou, L., He, Y.: High quality microblog extraction based on multiple features fusion and time-frequency transformation. In: International conference on web information systems engineering, pp. 188–201 (2013)
Peng, M., Xie, Q., Zhang, Y., Wang, H., Zhang, X., Huang, J., Tian, G.: Neural sparse topical coding. In: Annual meeting of the association for computational, pp. 2332–2340 (2018)
Peng, M., Zhu, J., Wang, H., Li, X., Zhang, Y., Zhang, X., Tian, G.: Mining event-oriented topics in microblog stream with unsupervised multi-view hierarchical embedding. ACM Trans. Knowl. Discov. Data 12(3), 1–26 (2018)
Phan, X. H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & Web with hidden topics from large-scale data collections. In: International Conference on World Wide Web, pp. 91–100 (2008)
Quan, X., Kit, C., Ge, Y., Pan, S.J.: Short and sparse text topic modeling via self-aggregation. In: International conference on artificial intelligence, pp. 2270–2276 (2015)
Shi, B., Lam, W., Jameel, S., Schockaert, S., Lai, K.P.: Jointly learning word embeddings and latent topics. In: International ACM SIGIR conference on research & development in information retrieval, pp. 375–38 (2017)
Sorokin, D., Gurevych, I.: Context-aware representations for knowledge base relation extraction. In: Conference on Empirical Methods in Natural Language Processing, pp. 1784–1789 (2017)
Sun, A.: Short text classification using very few words. In: International ACM SIGIR conference on research & development in information retrieval, pp. 1145–1146 (2012)
Wen, J., Tu, H., Cheng, X., Xie, R., Yin, W.: Joint modeling of users, questions and answers for answer selection in CQA. Exp. Syst. Appl. 118(2019), 563–572 (2019)
Weng, J., Lim, E.P., Jiang, J., He, Q.: Twitterrank: Finding topic-sensitive influential Twitterers. In: ACM international conference on Web search and data mining, pp. 261–270 (2010)
Wu, Z., Lei, L., Li, G., Huang, H., Zheng, C., Chen, E.: A topic modeling based approach to novel document automatic summarization. Exp. Syst. Appl. 84 (2017), 12–23 (2017)
Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: International conference on World Wide Web, pp. 1445–1456 (2013)
Yan, X., Guo, J., Lan, Y., Xu, J., Cheng, X.: A probabilistic model for bursty topic discovery in microblogs. In: AAAI conference on artificial intelligence, pp. 353–359 (2015)
Yang, Y., Wang, F., Zhang, J., Xu, J., Yu, P.S.: A topic model for co-occurring normal documents and short texts. World Wide Web 21(2), 487–513 (2018)
Yin, J., Wang, J.: A Dirichlet multinomial mixture model-based approach for short text clustering. In: ACM SIGKDD international conference on knowledge discovery & data mining, pp. 233–242 (2014)
Acknowledgements
The work was supported by the National Natural Science Foundation of China (NSFC, No.61802194, No.61902190) and Natural Science Foundation in University of Jiangsu Province, China (No.17KJB520015, No.19KJB520040).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Huang, J., Peng, M., Li, P. et al. Improving biterm topic model with word embeddings. World Wide Web 23, 3099–3124 (2020). https://doi.org/10.1007/s11280-020-00823-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-020-00823-w