Abstract
Topic modeling is an unsupervised learning task that discovers the hidden topics in a collection of documents. In turn, the discovered topics can be used for summarizing, organizing, and understanding the documents in the collection. Most of the existing techniques for topic modeling are derivatives of the Latent Dirichlet Allocation which uses a bag-of-word assumption for the documents. However, bag-of-words models completely dismiss the relationships between the words. For this reason, this article presents a two-stage algorithm for topic modelling that leverages word embeddings and word co-occurrence. In the first stage, we determine the topic-word distributions by soft-clustering a random set of embedded n-grams from the documents. In the second stage, we determine the document-topic distributions by sampling the topics of each document from the topic-word distributions. This approach leverages the distributional properties of word embeddings instead of using the bag-of-words assumption. Experimental results on various data sets from an Australian compensation organization show the remarkable comparative effectiveness of the proposed algorithm in a task of document classification.
- Sanjeev Arora, Rong Ge, and Ankur Moitra. 2012. Learning topic models - Going beyond SVD. In Proceedings of the Annual IEEE Symposium on Foundations of Computer Science (FOCS). DOI:https://doi.org/10.1109/FOCS.2012.49 arXiv:1204.1956 Google ScholarDigital Library
- David M. Blei and John D. Lafferty. 2006. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning (ICML’06). ACM, New York,, 113–120. DOI:https://doi.org/10.1145/1143844.1143859 Google ScholarDigital Library
- David M. Blei and John D. Lafferty. 2007. A correlated topic model of Science. Ann. Appl. Stat. 1, 1 (06 2007), 17–35. DOI:https://doi.org/10.1214/07-AOAS114Google Scholar
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, null (March 2003), 993–1022. Google ScholarDigital Library
- Andrea Cassioli, A. Chiavaioli, Costanzo Manes, and Marco Sciandrone. 2013. An incremental least squares algorithm for large scale linear classification. Eur. J. Oper. Res. 224, 3 (2013), 560–565.Google ScholarCross Ref
- Jonathan Chang. 2010. Not-so-latent Dirichlet allocation: Collapsed Gibbs sampling using human judgments. Computational Linguistics (2010). Google ScholarDigital Library
- Rajarshi Das, Manzil Zaheer, and Chris Dyer. 2015. Gaussian LDA for topic models with word embeddings. In Proceedings of ACL 2015 (2015), 795–804. Google ScholarCross Ref
- Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41, 6 (1990), 391–407.Google ScholarCross Ref
- Mohamed Dermouche, Julien Velcin, Leila Khouas, and Sabine Loudcher. 2014. A joint model for topic-sentiment evolution over time. In Proceedings of the IEEE International Conference on Data Mining (ICDM). DOI:https://doi.org/10.1109/ICDM.2014.82 Google ScholarDigital Library
- Junxian He, Zhiting Hu, Taylor Berg-Kirkpatrick, Ying Huang, and Eric P. Xing. 2017. Efficient correlated topic modeling with topic embedding. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. DOI:https://doi.org/10.1145/3097983.3098074 Google ScholarDigital Library
- Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99). ACM, New York, 50–57. DOI:https://doi.org/10.1145/312624.312649 Google ScholarDigital Library
- Wei Shou Hsu and Pascal Poupart. 2016. Online Bayesian moment matching for topic modeling with unknown number of topics. In Advances in Neural Information Processing Systems. Google ScholarDigital Library
- Yuening Hu and Jordan Boyd-Graber. 2012. Efficient tree-based topic modeling. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012). Google ScholarDigital Library
- Di Jiang, Lei Shi, Rongzhong Lian, and Hua Wu. 2016. Latent topic embedding. In Coling. 2689–2698.Google Scholar
- Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. From word embeddings to document distances. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (ICML’15). JMLR.org, 957–966. Google ScholarDigital Library
- Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2016. Topic modeling for short texts with auxiliary word embeddings. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2016.) DOI:https://doi.org/10.1145/2911451.2911499 Google ScholarDigital Library
- Yang Liu, Zhiyuan Liu, Tat Seng Chua, and Maosong Sun. 2015. Topical word embeddings. In Proceedings of the National Conference on Artificial Intelligence. Google ScholarDigital Library
- Jeffrey Lund, Piper Armstrong, Wilson Fearn, Stephen Cowley, Courtni Byun, Jordan Boyd-Graber, and Kevin Seppi. 2020. Automatic and human evaluation of local topic quality. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, (ACL 2019). Google Scholar
- Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, . Google ScholarDigital Library
- Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, and Chengxiang Zhai. 2007. Topic sentiment mixture: Modeling facets and opinions in weblogs. In Proceedings of the 16th International World Wide Web Conference (WWW2007). DOI:https://doi.org/10.1145/1242572.1242596 Google ScholarDigital Library
- Dheeraj Mekala, Vivek Gupta, Bhargavi Paranjape, and Harish Karnick. 2017. SCDV: Sparse composite document vectors using soft clustering over distributional representations. In Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2017). arXiv:1612.06778 http://arxiv.org/abs/1612.06778Google ScholarCross Ref
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of the 1st International Conference on Learning Representations (ICLR 2013) - Workshop Track Proceedings. arXiv:1301.3781Google Scholar
- David Newman, Edwin V. Bonilla, and Wray Buntine. 2011. Improving topic coherence with regularized topic models. In Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 496–504. http://papers.nips.cc/paper/4291-improving-topic-coherence-with-regularized-topic-models.pdf Google ScholarDigital Library
- R. Socher, C. D. Manning, and J. Pennington.2014. GloVe: Global vectors for word representation. In Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), Vol. 12. 1532–1543.Google Scholar
- Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D. Manning. 2009. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: A Meeting of SIGDAT, a Special Interest Group of ACL, Held in Conjunction with ACL-IJCNLP 2009. Google ScholarDigital Library
- Sattar Seifollahi, Massimo Piccardi, and Ehsan Z. Borzeshi. 2017. A semi-supervised hidden Markov topic model based on prior knowledge. In Proceedings of the 15th Australasian Data Mining Conference (AusDm 2017).Google Scholar
- Bei Shi, Wai Lam, Shoaib Jameel, Steven Schockaert, and Kwun Ping Lai. 2017. Jointly learning word embeddings and latent topics. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2017). DOI:https://doi.org/10.1145/3077136.3080806 Google ScholarDigital Library
- Richard Socher, John Bauer, Christopher D. Manning, and Andrew Y. Ng. 2013. Parsing with compositional vector grammars. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Sofia, Bulgaria, 455–465. https://www.aclweb.org/anthology/P13-1045Google Scholar
- Akash Srivastava and Charles A. Sutton. 2017. Autoencoding variational inference for topic models. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), (Toulon, France, April 24-26, 2017), Conference Track Proceedings.Google Scholar
- Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. 2006. Hierarchical Dirichlet processes. J. Amer. Statist. Assoc. (2006). DOI:https://doi.org/10.1198/016214506000000302Google Scholar
- Stefanie Tellex, Boris Katz, Jimmy Lin, Aaron Fernandes, and Gregory Marton. 2003. Quantitative evaluation of passage retrieval algorithms for question answering. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval (SIGIR’03). ACM, New York,, 41–47. DOI:https://doi.org/10.1145/860435.860445 Google ScholarDigital Library
- Joseph Turian, Lev Ratinov, Yoshua Bengio, and Dan Roth. 2009. A preliminary evaluation of word representations for named-entity recognition. In Proceedings of the NIPS Workshop on Grammar Induction, Representation of Language and Language Learning. http://www.iro.umontreal.ca/ lisa/pointeurs/wordrepresentations-ner.pdfGoogle Scholar
- Chong Wang, John Paisley, and David M. Blei. 2011. Online variational inference for the hierarchical Dirichlet process. In Journal of Machine Learning Research. Google Scholar
- Xuerui Wang and Andrew McCallum. 2006. Topics over time: A non-Markov continuous-time model of topical trends. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Google ScholarDigital Library
- Hongteng Xu, Wenlin Wang, Wei Liu, and Lawrence Carin. 2018. Distilled Wasserstein learning for word embedding and topic modeling. In Advances in Neural Information Processing Systems. arXiv:1809.04705 Google ScholarDigital Library
- Guangxu Xun, Yaliang Li, Jing Gao, and Aidong Zhang. 2017. Collaboratively improving topic discovery and word embeddings by coordinating global and local contexts. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. DOI:https://doi.org/10.1145/3097983.3098009 Google ScholarDigital Library
- Yi Yang, Doug Downey, and Jordan Boyd-Graber. 2015. Efficient methods for incorporating knowledge into topic models. In Proceedings: Conference on Empirical Methods in Natural Language Processing (EMNLP 2015). DOI:https://doi.org/10.18653/v1/d15-1037Google ScholarCross Ref
Index Terms
- An Embedding-Based Topic Model for Document Classification
Recommendations
A topic-enhanced word embedding for Twitter sentiment classification
Word representation is crucial to lexical features used in Twitter sentiment analysis models. Recent work has demonstrated that dense, low-dimensional and real-valued word embedding gives competitive performance for Twitter sentiment classification. We ...
Research on Multi-document Summarization Based on LDA Topic Model
IHMSC '14: Proceedings of the 2014 Sixth International Conference on Intelligent Human-Machine Systems and Cybernetics - Volume 02Compared with VSM (Vector Space Model) and graph-ranking models, LDA (Latent Dirichlet Allocation) Model can discover latent topics in the corpus and latent topics are beneficial to use sentence-ranking mechanisms to form a good summary. In the paper, ...
Topic modelling for qualitative studies
Qualitative studies, such as sociological research, opinion analysis and media studies, can benefit greatly from automated topic mining provided by topic models such as latent Dirichlet allocation LDA. However, examples of qualitative studies that ...
Comments