Abstract
Analyzing texts from social media encounters many challenges due to their unique characteristics of shortness, massiveness, and dynamic. Short texts do not provide enough context information, causing the failure of the traditional statistical models. Furthermore, many applications often face with massive and dynamic short texts, causing various computational challenges to the current batch learning algorithms. This paper presents a novel framework, namely bag of biterms modeling (BBM), for modeling massive, dynamic, and short text collections. BBM comprises of two main ingredients: (1) the concept of bag of biterms (BoB) for representing documents, and (2) a simple way to help statistical models to include BoB. Our framework can be easily deployed for a large class of probabilistic models, and we demonstrate its usefulness with two well-known models: latent Dirichlet allocation (LDA) and hierarchical Dirichlet process (HDP). By exploiting both terms (words) and biterms (pairs of words), the major advantages of BBM are: (1) it enhances the length of the documents and makes the context more coherent by emphasizing the word connotation and co-occurrence via bag of biterms, and (2) it inherits inference and learning algorithms from the primitive to make it straightforward to design online and streaming algorithms for short texts. Extensive experiments suggest that BBM outperforms several state-of-the-art models. We also point out that the BoB representation performs better than the traditional representations (e.g., bag of words, tf-idf) even for normal texts.
Similar content being viewed by others
Notes
We use the source code of Online LDA and Online HDP from https://github.com/Blei-Lab.
References
Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of the fifteenth conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., Burlington, pp 289–296
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical Dirichlet processes. J Am Stat Assoc 101(476):1566–1581
Tang J, Meng Z, Nguyen X, Mei Q, Zhang M (2014) Understanding the limiting factors of topic modeling via posterior contraction analysis. In: Proceedings of The 31st international conference on machine learning, pp 190–198
Than K, Doan T (2014) Dual online inference for latent Dirichlet allocation. In: ACML
Sahami M, Heilman TD (2006) A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of the 15th international conference on world wide web. ACM, pp 377–386
Bollegala D, Matsuo Y, Ishizuka M (2007) Measuring semantic similarity between words using web search engines. In: www, vol 7, pp 757–766
Yih W-T, Meek C (2007) Improving similarity measures for short segments of text. In: AAAI, vol 7, pp 1489–1494
Banerjee S, Ramanathan K, Gupta A (2007) Clustering short texts using wikipedia. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 787–788
Schönhofen P (2009) Identifying document topics using the wikipedia category network. Web Intell Agent Syst 7(2):195–207
Phan X-H, Nguyen L-M, Horiguchi S (2008) Learning to classify short and sparse text and web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on world wide web. ACM, pp 91–100
Mehrotra R, Sanner S, Buntine W, Xie L (2013) Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 889–892
Grant CE, George CP, Jenneisch C, Wilson JN (2011) Online topic modeling for real-time Twitter search. In: TREC
Ye C, Wen W (2014) PY: TM-HDP—an effective nonparametric topic model for Tibetan messages. J Comput Inf Syst 10:10433–10444
Hong L, Davison BD (2010) Empirical study of topic modeling in Twitter. In: Proceedings of the first workshop on social media analytics. ACM, pp 80–88
Qiang J, Chen P, Wang T, Wu X (2017) Topic modeling over short texts by incorporating word embeddings. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 363–374
Zhao H, Du L, Buntine W (2017) A word embeddings informed focused topic model. In: Asian conference on machine learning, pp 423–438
Li C, Duan Y, Wang H, Zhang Z, Sun A, Ma Z (2017) Enhancing topic modeling for short texts with auxiliary word embeddings. ACM Trans Inf Syst (TOIS) 36(2):11
Weng J, Lim E-P, Jiang J, He Q (2010) Twitterrank: finding topic-sensitive influential Twitterers. In: Proceedings of the third ACM international conference on web search and data mining. ACM, pp 261–270
Jiang L, Lu H, Xu M, Wang C (2016) Biterm pseudo document topic model for short text. In: 2016 IEEE 28th International conference on tools with artificial intelligence (ICTAI). IEEE, pp 865–872
Bicalho P, Pita M, Pedrosa G, Lacerda A, Pappa GL (2017) A general framework to expand short text for topic modeling. Inf Sci 393:66–81
Yang Y, Wang F, Zhang J, Xu J, Philip SY (2018) A topic model for co-occurring normal documents and short texts. World Wide Web 21(2):487–513
Zuo Y, Wu J, Zhang H, Lin H, Wang F, Xu K, Xiong H (2016) Topic modeling of short texts: a pseudo-document view. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 2105–2114
Quan X, Kit C, Ge Y, Pan SJ (2015) Short and sparse text topic modeling via self-aggregation. In: IJCAI, pp 2270–2276
Cheng X, Yan X, Lan Y, Guo J (2014) BTM: topic modeling over short texts. IEEE Trans Knowl Data Eng 26(12):2928–2941
Wang C, Paisley JW, Blei DM (2011) Online variational inference for the hierarchical Dirichlet process. In: AISTATS, vol 2, p 4
Broderick T, Boyd N, Wibisono A, Wilson AC, Jordan MI (2013) Streaming variational Bayes. In: Proceedings of advances in neural information processing systems conferences. Curran Associates, pp 1727–1735
Duc AN, Van Linh N, Kim AN, Than K (2017) Keeping priors in streaming Bayesian learning. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 247–258
Mimno D, Wallach HM, Talley E, Leenders M, McCallum A (2011) Optimizing semantic coherence in topic models. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 262–272
Yan X, Guo J, Lan Y, Cheng X (2013) A biterm topic model for short texts. In: Proceedings of the 22nd international conference on world wide web. ACM, pp 1445–1456
Hoffman M, Bach FR, Blei DM (2010) Online learning for latent Dirichlet allocation. In: Proceedings of advances in neural information processing systems conferences. Curran Associates, pp 856–864
Hoffman MD, Blei DM, Wang C, Paisley J (2013) Stochastic variational inference. J Mach Learn Res 14(1):1303–1347
Mai K, Mai S, Nguyen A, Van Linh N, Than K (2006) Enabling hierarchical Dirichlet processes to work better for short texts at large scale. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 431–442
Chang C-C, Lin C-J, (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27:1–27:27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Bouma G (2009) Normalized (pointwise) mutual information in collocation extraction. In: Proceedings of GSCL, pp 31–40
Acknowledgements
This research is supported by Vingroup Innovation Foundation (VINIF) in project code VINIF.2019.DA18, and by the Office of Naval Research Global (ONRG) under Award Number N62909-18-1-2072, and Air Force Office of Scientific Research (AFOSR), Asian Office of Aerospace Research & Development (AOARD) under Award Number 17IOA031.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This paper is an extended version of our PAKDD2016 paper [33].
Appendices
Appendix A: Supplementary experimental result
To strengthen the experimental result in Sect. 5, we conduct some experiments with another evaluation metrics besides log predictive probability (LPP), i.e., normalized pointwise mutual information (NPMI) [36]. We evaluate the NMI of the two models, online HDP-B and online LDA-B, compared with their base models using BoW and BoB. The settings and model in use are exactly the same as those in Sect. 5.1.1. We adopt the four datasets: Tweet, NYtimes, Yahoo and TMNTitle.
Normalized pointwise mutual information (NPMI) A standard metric to measure the association between a pair of discrete outcomes x and y, defined as:
Figures 19 and 20 show the evaluation of these models by NPMI. Figure 19 confirms that OHDP-B performs better than HDP in almost all of the cases. Figure 19 shows that OLDA-B achieves better result than the other three models, while OBTM ranks behind only OLDA-B in NYtimes, Yahoo. Also, in most of the cases in both the two figures, the results of using BoW show that this representation is not suitable with short text datasets.
Appendix B: Conversion of topic-over-biterms (distribution over biterms) to topic-over-words (distribution over words)
In BoB, after we finish training the model, we obtain topics that are multinomial distributions over biterms. We would like to convert these topic-over-biterms to the topics-over-words (i.e., distribution over words). Assume that \(\varvec{\phi }_k\) is a distribution over biterms of topic k. The procedure to perform this conversion is as follows: \(p(w_i \mid z=k) = \sum _{j=1}^{V}p(w_i,w_j \mid z=k) = \sum _{j=1}^{V}p(b_{ij} \mid z=k) = \sum _{j=1}^{V}\phi _{kb_{ij}}\), where V is the vocabulary size in BoW and \(b_{ij}\) is the biterm created from the pair (\(w_i,w_j\)).
As discussed in Sect. 3.1.1, in the implementation of BoB, we can merge \(b_{ij}\) and \(b_{ji}\) into \(b_{ij}\) with \(i<j\). Because of the identical occurrence in every document, after finishing the training process, the value of \(p(b_{ij}\mid z=k)\) will be expectedly the same as \(p(b_{ji}\mid z=k)\). Therefore, in grouping these biterms into one, the conversion version of this implementation is: \( p(w_i\mid z=k) = \sum _{j=1}^{V}p(b_{ij} \mid z=k) = \phi _{kb_{ii}} + \frac{1}{2}\sum _{\text {b: biterms contain } w_i}\phi _{kb} \)
Appendix C: Parameters inference for LDA-B
1.1 C.1 Lower bound function
The log likelihood is bounded by the lower bound induced from the Jensen inequality:
This lower bound can be written as follows:
where \(\tilde{w}^{(1)}_{dm}\) and \(\tilde{w}^{(2)}_{dm} \) denoted as the first word and second word of biterm \(\tilde{w}_{dm} \). \(\tilde{z}_{dm} \) is the topic assignment of \(\tilde{w}_{dm} \). The expansion for this equation is similar to [2]. The only difference part lies in the biterm \((\tilde{w}^{(1)}_{dm}, \tilde{w}^{(2)}_{dm})\):
Then, we will maximize the lower bound function in each dimension of the variational parameters.
1.2 C.2 Variation parameter \(\lambda \)
Choose a topic index k. Fix \(\gamma , \phi \) and each \(\lambda _j \) for \( j \ne k\). The updated value for parameter \(\lambda \) is computed by set the derivative of \(L(\lambda _k)\) to zero:
1.3 C.3 Variation parameter \(\gamma \)
Choose a document d, fix \(\lambda , \phi \) and each \(\gamma _c \) for \( c \ne d\). Similar to \(\lambda \), we obtain:
1.4 C.4 Variation parameter \(\phi \)
Fixing \(\lambda , \gamma , \tilde{\phi } \) and \( \phi _{cu} \) for \((c,u) \ne (d,v) \). Using the Lagrange multipliers method with constraint \(\sum _{k=1}^K \phi _{dnk} = 1 \), we have:
1.5 C.5 Variation parameter \(\tilde{\phi }\)
Fixing \(\lambda , \gamma , \phi \) and \( \tilde{\phi }_{cu} \) for \((c,u) \ne (d,b) \). Using the Lagrange multipliers method with constraint \(\sum _{k=1}^K \tilde{\phi }_{dmk} = 1 \), we obtain:
Here, we denote \(\tilde{w}_{dm}^{(1)}\) and \(\tilde{w}_{dm}^{(2)}\) as the first word and second word of biterm \(\tilde{w}_{dm}\), respectively.
Rights and permissions
About this article
Cite this article
Tuan, A.P., Tran, B., Nguyen, T.H. et al. Bag of biterms modeling for short texts. Knowl Inf Syst 62, 4055–4090 (2020). https://doi.org/10.1007/s10115-020-01482-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-020-01482-z