research-article

An Embedding-Based Topic Model for Document Classification

Authors:
Sattar Seifollahi

School of Computing Technologies, RMIT University, Melbourne, Australia

School of Computing Technologies, RMIT University, Melbourne, Australia

0000-0002-5325-9724
View Profile

,
Massimo Piccardi

School of Electrical and Data Engineering, University of Technology Sydney, NSW, Australia

School of Electrical and Data Engineering, University of Technology Sydney, NSW, Australia

0000-0001-9250-6604
View Profile

,
Alireza Jolfaei

Department of Computing, Macquarie University, NSW, Australia

Department of Computing, Macquarie University, NSW, Australia

0000-0001-7818-459X
View Profile

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 20 Issue 3Article No.: 52pp 1–13https://doi.org/10.1145/3431728

Published:05 May 2021Publication History

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

Topic modeling is an unsupervised learning task that discovers the hidden topics in a collection of documents. In turn, the discovered topics can be used for summarizing, organizing, and understanding the documents in the collection. Most of the existing techniques for topic modeling are derivatives of the Latent Dirichlet Allocation which uses a bag-of-word assumption for the documents. However, bag-of-words models completely dismiss the relationships between the words. For this reason, this article presents a two-stage algorithm for topic modelling that leverages word embeddings and word co-occurrence. In the first stage, we determine the topic-word distributions by soft-clustering a random set of embedded n-grams from the documents. In the second stage, we determine the document-topic distributions by sampling the topics of each document from the topic-word distributions. This approach leverages the distributional properties of word embeddings instead of using the bag-of-words assumption. Experimental results on various data sets from an Australian compensation organization show the remarkable comparative effectiveness of the proposed algorithm in a task of document classification.

References

Sanjeev Arora, Rong Ge, and Ankur Moitra. 2012. Learning topic models - Going beyond SVD. In Proceedings of the Annual IEEE Symposium on Foundations of Computer Science (FOCS). DOI:https://doi.org/10.1109/FOCS.2012.49 arXiv:1204.1956 Google ScholarDigital Library
David M. Blei and John D. Lafferty. 2006. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning (ICML’06). ACM, New York,, 113–120. DOI:https://doi.org/10.1145/1143844.1143859 Google ScholarDigital Library
David M. Blei and John D. Lafferty. 2007. A correlated topic model of Science. Ann. Appl. Stat. 1, 1 (06 2007), 17–35. DOI:https://doi.org/10.1214/07-AOAS114Google Scholar
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, null (March 2003), 993–1022. Google ScholarDigital Library
Andrea Cassioli, A. Chiavaioli, Costanzo Manes, and Marco Sciandrone. 2013. An incremental least squares algorithm for large scale linear classification. Eur. J. Oper. Res. 224, 3 (2013), 560–565.Google ScholarCross Ref
Jonathan Chang. 2010. Not-so-latent Dirichlet allocation: Collapsed Gibbs sampling using human judgments. Computational Linguistics (2010). Google ScholarDigital Library
Rajarshi Das, Manzil Zaheer, and Chris Dyer. 2015. Gaussian LDA for topic models with word embeddings. In Proceedings of ACL 2015 (2015), 795–804. Google ScholarCross Ref
Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41, 6 (1990), 391–407.Google ScholarCross Ref
Mohamed Dermouche, Julien Velcin, Leila Khouas, and Sabine Loudcher. 2014. A joint model for topic-sentiment evolution over time. In Proceedings of the IEEE International Conference on Data Mining (ICDM). DOI:https://doi.org/10.1109/ICDM.2014.82 Google ScholarDigital Library
Junxian He, Zhiting Hu, Taylor Berg-Kirkpatrick, Ying Huang, and Eric P. Xing. 2017. Efficient correlated topic modeling with topic embedding. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. DOI:https://doi.org/10.1145/3097983.3098074 Google ScholarDigital Library
Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99). ACM, New York, 50–57. DOI:https://doi.org/10.1145/312624.312649 Google ScholarDigital Library
Wei Shou Hsu and Pascal Poupart. 2016. Online Bayesian moment matching for topic modeling with unknown number of topics. In Advances in Neural Information Processing Systems. Google ScholarDigital Library
Yuening Hu and Jordan Boyd-Graber. 2012. Efficient tree-based topic modeling. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012). Google ScholarDigital Library
Di Jiang, Lei Shi, Rongzhong Lian, and Hua Wu. 2016. Latent topic embedding. In Coling. 2689–2698.Google Scholar
Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. From word embeddings to document distances. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (ICML’15). JMLR.org, 957–966. Google ScholarDigital Library
Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2016. Topic modeling for short texts with auxiliary word embeddings. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2016.) DOI:https://doi.org/10.1145/2911451.2911499 Google ScholarDigital Library
Yang Liu, Zhiyuan Liu, Tat Seng Chua, and Maosong Sun. 2015. Topical word embeddings. In Proceedings of the National Conference on Artificial Intelligence. Google ScholarDigital Library
Jeffrey Lund, Piper Armstrong, Wilson Fearn, Stephen Cowley, Courtni Byun, Jordan Boyd-Graber, and Kevin Seppi. 2020. Automatic and human evaluation of local topic quality. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, (ACL 2019). Google Scholar
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, . Google ScholarDigital Library
Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, and Chengxiang Zhai. 2007. Topic sentiment mixture: Modeling facets and opinions in weblogs. In Proceedings of the 16th International World Wide Web Conference (WWW2007). DOI:https://doi.org/10.1145/1242572.1242596 Google ScholarDigital Library
Dheeraj Mekala, Vivek Gupta, Bhargavi Paranjape, and Harish Karnick. 2017. SCDV: Sparse composite document vectors using soft clustering over distributional representations. In Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2017). arXiv:1612.06778 http://arxiv.org/abs/1612.06778Google ScholarCross Ref
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of the 1st International Conference on Learning Representations (ICLR 2013) - Workshop Track Proceedings. arXiv:1301.3781Google Scholar
David Newman, Edwin V. Bonilla, and Wray Buntine. 2011. Improving topic coherence with regularized topic models. In Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 496–504. http://papers.nips.cc/paper/4291-improving-topic-coherence-with-regularized-topic-models.pdf Google ScholarDigital Library
R. Socher, C. D. Manning, and J. Pennington.2014. GloVe: Global vectors for word representation. In Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), Vol. 12. 1532–1543.Google Scholar
Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D. Manning. 2009. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: A Meeting of SIGDAT, a Special Interest Group of ACL, Held in Conjunction with ACL-IJCNLP 2009. Google ScholarDigital Library
Sattar Seifollahi, Massimo Piccardi, and Ehsan Z. Borzeshi. 2017. A semi-supervised hidden Markov topic model based on prior knowledge. In Proceedings of the 15th Australasian Data Mining Conference (AusDm 2017).Google Scholar
Bei Shi, Wai Lam, Shoaib Jameel, Steven Schockaert, and Kwun Ping Lai. 2017. Jointly learning word embeddings and latent topics. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2017). DOI:https://doi.org/10.1145/3077136.3080806 Google ScholarDigital Library
Richard Socher, John Bauer, Christopher D. Manning, and Andrew Y. Ng. 2013. Parsing with compositional vector grammars. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Sofia, Bulgaria, 455–465. https://www.aclweb.org/anthology/P13-1045Google Scholar
Akash Srivastava and Charles A. Sutton. 2017. Autoencoding variational inference for topic models. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), (Toulon, France, April 24-26, 2017), Conference Track Proceedings.Google Scholar
Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. 2006. Hierarchical Dirichlet processes. J. Amer. Statist. Assoc. (2006). DOI:https://doi.org/10.1198/016214506000000302Google Scholar
Stefanie Tellex, Boris Katz, Jimmy Lin, Aaron Fernandes, and Gregory Marton. 2003. Quantitative evaluation of passage retrieval algorithms for question answering. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval (SIGIR’03). ACM, New York,, 41–47. DOI:https://doi.org/10.1145/860435.860445 Google ScholarDigital Library
Joseph Turian, Lev Ratinov, Yoshua Bengio, and Dan Roth. 2009. A preliminary evaluation of word representations for named-entity recognition. In Proceedings of the NIPS Workshop on Grammar Induction, Representation of Language and Language Learning. http://www.iro.umontreal.ca/ lisa/pointeurs/wordrepresentations-ner.pdfGoogle Scholar
Chong Wang, John Paisley, and David M. Blei. 2011. Online variational inference for the hierarchical Dirichlet process. In Journal of Machine Learning Research. Google Scholar
Xuerui Wang and Andrew McCallum. 2006. Topics over time: A non-Markov continuous-time model of topical trends. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Google ScholarDigital Library
Hongteng Xu, Wenlin Wang, Wei Liu, and Lawrence Carin. 2018. Distilled Wasserstein learning for word embedding and topic modeling. In Advances in Neural Information Processing Systems. arXiv:1809.04705 Google ScholarDigital Library
Guangxu Xun, Yaliang Li, Jing Gao, and Aidong Zhang. 2017. Collaboratively improving topic discovery and word embeddings by coordinating global and local contexts. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. DOI:https://doi.org/10.1145/3097983.3098009 Google ScholarDigital Library
Yi Yang, Doug Downey, and Jordan Boyd-Graber. 2015. Efficient methods for incorporating knowledge into topic models. In Proceedings: Conference on Empirical Methods in Natural Language Processing (EMNLP 2015). DOI:https://doi.org/10.18653/v1/d15-1037Google ScholarCross Ref

Index Terms

An Embedding-Based Topic Model for Document Classification
1. Applied computing
  1. Document management and text processing
    1. Document capture
      1. Document analysis
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Document topic models

Recommendations

A topic-enhanced word embedding for Twitter sentiment classification

Word representation is crucial to lexical features used in Twitter sentiment analysis models. Recent work has demonstrated that dense, low-dimensional and real-valued word embedding gives competitive performance for Twitter sentiment classification. We ...
Read More
Research on Multi-document Summarization Based on LDA Topic Model
IHMSC '14: Proceedings of the 2014 Sixth International Conference on Intelligent Human-Machine Systems and Cybernetics - Volume 02

Compared with VSM (Vector Space Model) and graph-ranking models, LDA (Latent Dirichlet Allocation) Model can discover latent topics in the corpus and latent topics are beneficial to use sentence-ranking mechanisms to form a good summary. In the paper, ...
Read More
Topic modelling for qualitative studies

Qualitative studies, such as sociological research, opinion analysis and media studies, can benefit greatly from automated topic mining provided by topic models such as latent Dirichlet allocation LDA. However, examples of qualitative studies that ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Asian and Low-Resource Language Information Processing Volume 20, Issue 3
May 2021
240 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3457152
Editor:
Imed Zitouni
Google, USA
Issue’s Table of Contents
Copyright © 2021 Association for Computing Machinery.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 May 2021
- Accepted: 1 October 2020
- Revised: 1 September 2020
- Received: 1 July 2020
Published in tallip Volume 20, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Topic modelling
word embedding
document classification
clustering
Qualifiers
- research-article
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 324
  Total Downloads
- Downloads (Last 12 months)91
- Downloads (Last 6 weeks)22
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

An Embedding-Based Topic Model for Document Classification

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

A topic-enhanced word embedding for Twitter sentiment classification

Research on Multi-document Summarization Based on LDA Topic Model

Topic modelling for qualitative studies

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

An Embedding-Based Topic Model for Document Classification

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

A topic-enhanced word embedding for Twitter sentiment classification

Research on Multi-document Summarization Based on LDA Topic Model

Topic modelling for qualitative studies

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media