skip to main content
research-article

An Embedding-Based Topic Model for Document Classification

Published:05 May 2021Publication History
Skip Abstract Section

Abstract

Topic modeling is an unsupervised learning task that discovers the hidden topics in a collection of documents. In turn, the discovered topics can be used for summarizing, organizing, and understanding the documents in the collection. Most of the existing techniques for topic modeling are derivatives of the Latent Dirichlet Allocation which uses a bag-of-word assumption for the documents. However, bag-of-words models completely dismiss the relationships between the words. For this reason, this article presents a two-stage algorithm for topic modelling that leverages word embeddings and word co-occurrence. In the first stage, we determine the topic-word distributions by soft-clustering a random set of embedded n-grams from the documents. In the second stage, we determine the document-topic distributions by sampling the topics of each document from the topic-word distributions. This approach leverages the distributional properties of word embeddings instead of using the bag-of-words assumption. Experimental results on various data sets from an Australian compensation organization show the remarkable comparative effectiveness of the proposed algorithm in a task of document classification.

References

  1. Sanjeev Arora, Rong Ge, and Ankur Moitra. 2012. Learning topic models - Going beyond SVD. In Proceedings of the Annual IEEE Symposium on Foundations of Computer Science (FOCS). DOI:https://doi.org/10.1109/FOCS.2012.49 arXiv:1204.1956 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. David M. Blei and John D. Lafferty. 2006. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning (ICML’06). ACM, New York,, 113–120. DOI:https://doi.org/10.1145/1143844.1143859 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. David M. Blei and John D. Lafferty. 2007. A correlated topic model of Science. Ann. Appl. Stat. 1, 1 (06 2007), 17–35. DOI:https://doi.org/10.1214/07-AOAS114Google ScholarGoogle Scholar
  4. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, null (March 2003), 993–1022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Andrea Cassioli, A. Chiavaioli, Costanzo Manes, and Marco Sciandrone. 2013. An incremental least squares algorithm for large scale linear classification. Eur. J. Oper. Res. 224, 3 (2013), 560–565.Google ScholarGoogle ScholarCross RefCross Ref
  6. Jonathan Chang. 2010. Not-so-latent Dirichlet allocation: Collapsed Gibbs sampling using human judgments. Computational Linguistics (2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Rajarshi Das, Manzil Zaheer, and Chris Dyer. 2015. Gaussian LDA for topic models with word embeddings. In Proceedings of ACL 2015 (2015), 795–804. Google ScholarGoogle ScholarCross RefCross Ref
  8. Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41, 6 (1990), 391–407.Google ScholarGoogle ScholarCross RefCross Ref
  9. Mohamed Dermouche, Julien Velcin, Leila Khouas, and Sabine Loudcher. 2014. A joint model for topic-sentiment evolution over time. In Proceedings of the IEEE International Conference on Data Mining (ICDM). DOI:https://doi.org/10.1109/ICDM.2014.82 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Junxian He, Zhiting Hu, Taylor Berg-Kirkpatrick, Ying Huang, and Eric P. Xing. 2017. Efficient correlated topic modeling with topic embedding. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. DOI:https://doi.org/10.1145/3097983.3098074 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99). ACM, New York, 50–57. DOI:https://doi.org/10.1145/312624.312649 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Wei Shou Hsu and Pascal Poupart. 2016. Online Bayesian moment matching for topic modeling with unknown number of topics. In Advances in Neural Information Processing Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Yuening Hu and Jordan Boyd-Graber. 2012. Efficient tree-based topic modeling. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Di Jiang, Lei Shi, Rongzhong Lian, and Hua Wu. 2016. Latent topic embedding. In Coling. 2689–2698.Google ScholarGoogle Scholar
  15. Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. From word embeddings to document distances. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (ICML’15). JMLR.org, 957–966. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2016. Topic modeling for short texts with auxiliary word embeddings. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2016.) DOI:https://doi.org/10.1145/2911451.2911499 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Yang Liu, Zhiyuan Liu, Tat Seng Chua, and Maosong Sun. 2015. Topical word embeddings. In Proceedings of the National Conference on Artificial Intelligence. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jeffrey Lund, Piper Armstrong, Wilson Fearn, Stephen Cowley, Courtni Byun, Jordan Boyd-Graber, and Kevin Seppi. 2020. Automatic and human evaluation of local topic quality. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, (ACL 2019). Google ScholarGoogle Scholar
  19. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, . Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, and Chengxiang Zhai. 2007. Topic sentiment mixture: Modeling facets and opinions in weblogs. In Proceedings of the 16th International World Wide Web Conference (WWW2007). DOI:https://doi.org/10.1145/1242572.1242596 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Dheeraj Mekala, Vivek Gupta, Bhargavi Paranjape, and Harish Karnick. 2017. SCDV: Sparse composite document vectors using soft clustering over distributional representations. In Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2017). arXiv:1612.06778 http://arxiv.org/abs/1612.06778Google ScholarGoogle ScholarCross RefCross Ref
  22. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of the 1st International Conference on Learning Representations (ICLR 2013) - Workshop Track Proceedings. arXiv:1301.3781Google ScholarGoogle Scholar
  23. David Newman, Edwin V. Bonilla, and Wray Buntine. 2011. Improving topic coherence with regularized topic models. In Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 496–504. http://papers.nips.cc/paper/4291-improving-topic-coherence-with-regularized-topic-models.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. Socher, C. D. Manning, and J. Pennington.2014. GloVe: Global vectors for word representation. In Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), Vol. 12. 1532–1543.Google ScholarGoogle Scholar
  25. Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D. Manning. 2009. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: A Meeting of SIGDAT, a Special Interest Group of ACL, Held in Conjunction with ACL-IJCNLP 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Sattar Seifollahi, Massimo Piccardi, and Ehsan Z. Borzeshi. 2017. A semi-supervised hidden Markov topic model based on prior knowledge. In Proceedings of the 15th Australasian Data Mining Conference (AusDm 2017).Google ScholarGoogle Scholar
  27. Bei Shi, Wai Lam, Shoaib Jameel, Steven Schockaert, and Kwun Ping Lai. 2017. Jointly learning word embeddings and latent topics. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2017). DOI:https://doi.org/10.1145/3077136.3080806 Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Richard Socher, John Bauer, Christopher D. Manning, and Andrew Y. Ng. 2013. Parsing with compositional vector grammars. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Sofia, Bulgaria, 455–465. https://www.aclweb.org/anthology/P13-1045Google ScholarGoogle Scholar
  29. Akash Srivastava and Charles A. Sutton. 2017. Autoencoding variational inference for topic models. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), (Toulon, France, April 24-26, 2017), Conference Track Proceedings.Google ScholarGoogle Scholar
  30. Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. 2006. Hierarchical Dirichlet processes. J. Amer. Statist. Assoc. (2006). DOI:https://doi.org/10.1198/016214506000000302Google ScholarGoogle Scholar
  31. Stefanie Tellex, Boris Katz, Jimmy Lin, Aaron Fernandes, and Gregory Marton. 2003. Quantitative evaluation of passage retrieval algorithms for question answering. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval (SIGIR’03). ACM, New York,, 41–47. DOI:https://doi.org/10.1145/860435.860445 Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Joseph Turian, Lev Ratinov, Yoshua Bengio, and Dan Roth. 2009. A preliminary evaluation of word representations for named-entity recognition. In Proceedings of the NIPS Workshop on Grammar Induction, Representation of Language and Language Learning. http://www.iro.umontreal.ca/ lisa/pointeurs/wordrepresentations-ner.pdfGoogle ScholarGoogle Scholar
  33. Chong Wang, John Paisley, and David M. Blei. 2011. Online variational inference for the hierarchical Dirichlet process. In Journal of Machine Learning Research. Google ScholarGoogle Scholar
  34. Xuerui Wang and Andrew McCallum. 2006. Topics over time: A non-Markov continuous-time model of topical trends. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Hongteng Xu, Wenlin Wang, Wei Liu, and Lawrence Carin. 2018. Distilled Wasserstein learning for word embedding and topic modeling. In Advances in Neural Information Processing Systems. arXiv:1809.04705 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Guangxu Xun, Yaliang Li, Jing Gao, and Aidong Zhang. 2017. Collaboratively improving topic discovery and word embeddings by coordinating global and local contexts. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. DOI:https://doi.org/10.1145/3097983.3098009 Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Yi Yang, Doug Downey, and Jordan Boyd-Graber. 2015. Efficient methods for incorporating knowledge into topic models. In Proceedings: Conference on Empirical Methods in Natural Language Processing (EMNLP 2015). DOI:https://doi.org/10.18653/v1/d15-1037Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. An Embedding-Based Topic Model for Document Classification

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 20, Issue 3
        May 2021
        240 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3457152
        Issue’s Table of Contents

        Copyright © 2021 Association for Computing Machinery.

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 5 May 2021
        • Accepted: 1 October 2020
        • Revised: 1 September 2020
        • Received: 1 July 2020
        Published in tallip Volume 20, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format