Bag of biterms modeling for short texts

Tuan, Anh Phan; Tran, Bach; Nguyen, Thien Huu; Van, Linh Ngo; Than, Khoat

doi:10.1007/s10115-020-01482-z

Bag of biterms modeling for short texts

Regular Paper
Published: 10 July 2020

Volume 62, pages 4055–4090, (2020)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Anh Phan Tuan¹^na1,
Bach Tran¹^na1,
Thien Huu Nguyen²,
Linh Ngo Van ORCID: orcid.org/0000-0002-0011-5137¹ &
…
Khoat Than¹

378 Accesses
7 Citations
Explore all metrics

Abstract

Analyzing texts from social media encounters many challenges due to their unique characteristics of shortness, massiveness, and dynamic. Short texts do not provide enough context information, causing the failure of the traditional statistical models. Furthermore, many applications often face with massive and dynamic short texts, causing various computational challenges to the current batch learning algorithms. This paper presents a novel framework, namely bag of biterms modeling (BBM), for modeling massive, dynamic, and short text collections. BBM comprises of two main ingredients: (1) the concept of bag of biterms (BoB) for representing documents, and (2) a simple way to help statistical models to include BoB. Our framework can be easily deployed for a large class of probabilistic models, and we demonstrate its usefulness with two well-known models: latent Dirichlet allocation (LDA) and hierarchical Dirichlet process (HDP). By exploiting both terms (words) and biterms (pairs of words), the major advantages of BBM are: (1) it enhances the length of the documents and makes the context more coherent by emphasizing the word connotation and co-occurrence via bag of biterms, and (2) it inherits inference and learning algorithms from the primitive to make it straightforward to design online and streaming algorithms for short texts. Extensive experiments suggest that BBM outperforms several state-of-the-art models. We also point out that the BoB representation performs better than the traditional representations (e.g., bag of words, tf-idf) even for normal texts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enabling Hierarchical Dirichlet Processes to Work Better for Short Texts at Large Scale

Word network topic model: a simple but general solution for short and imbalanced texts

Article 23 September 2015

Online Topic Modeling for Short Texts

Notes

http://acube.di.unipi.it/TMNdataset.
https://answers.yahoo.com/.
http://twitter.com/.
http://www.nytimes.com/.
We use the source code of Online LDA and Online HDP from https://github.com/Blei-Lab.
http://qwone.com/~jason/20Newsgroups/.
http://glaros.dtc.umn.edu/gkhome/fetch/sw/cluto/datasets.tar.gz.
http://davis.wpi.edu/xmdv/datasets/ohsumed.html.
https://www.medline.com.
https://www.csie.ntu.edu.tw/~cjlin/libsvm/.
https://nlp.stanford.edu/projects/glove/.

References

Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of the fifteenth conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., Burlington, pp 289–296
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical Dirichlet processes. J Am Stat Assoc 101(476):1566–1581
Article MathSciNet Google Scholar
Tang J, Meng Z, Nguyen X, Mei Q, Zhang M (2014) Understanding the limiting factors of topic modeling via posterior contraction analysis. In: Proceedings of The 31st international conference on machine learning, pp 190–198
Than K, Doan T (2014) Dual online inference for latent Dirichlet allocation. In: ACML
Sahami M, Heilman TD (2006) A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of the 15th international conference on world wide web. ACM, pp 377–386
Bollegala D, Matsuo Y, Ishizuka M (2007) Measuring semantic similarity between words using web search engines. In: www, vol 7, pp 757–766
Yih W-T, Meek C (2007) Improving similarity measures for short segments of text. In: AAAI, vol 7, pp 1489–1494
Banerjee S, Ramanathan K, Gupta A (2007) Clustering short texts using wikipedia. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 787–788
Schönhofen P (2009) Identifying document topics using the wikipedia category network. Web Intell Agent Syst 7(2):195–207
Article Google Scholar
Phan X-H, Nguyen L-M, Horiguchi S (2008) Learning to classify short and sparse text and web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on world wide web. ACM, pp 91–100
Mehrotra R, Sanner S, Buntine W, Xie L (2013) Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 889–892
Grant CE, George CP, Jenneisch C, Wilson JN (2011) Online topic modeling for real-time Twitter search. In: TREC
Ye C, Wen W (2014) PY: TM-HDP—an effective nonparametric topic model for Tibetan messages. J Comput Inf Syst 10:10433–10444
Google Scholar
Hong L, Davison BD (2010) Empirical study of topic modeling in Twitter. In: Proceedings of the first workshop on social media analytics. ACM, pp 80–88
Qiang J, Chen P, Wang T, Wu X (2017) Topic modeling over short texts by incorporating word embeddings. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 363–374
Zhao H, Du L, Buntine W (2017) A word embeddings informed focused topic model. In: Asian conference on machine learning, pp 423–438
Li C, Duan Y, Wang H, Zhang Z, Sun A, Ma Z (2017) Enhancing topic modeling for short texts with auxiliary word embeddings. ACM Trans Inf Syst (TOIS) 36(2):11
Article Google Scholar
Weng J, Lim E-P, Jiang J, He Q (2010) Twitterrank: finding topic-sensitive influential Twitterers. In: Proceedings of the third ACM international conference on web search and data mining. ACM, pp 261–270
Jiang L, Lu H, Xu M, Wang C (2016) Biterm pseudo document topic model for short text. In: 2016 IEEE 28th International conference on tools with artificial intelligence (ICTAI). IEEE, pp 865–872
Bicalho P, Pita M, Pedrosa G, Lacerda A, Pappa GL (2017) A general framework to expand short text for topic modeling. Inf Sci 393:66–81
Article Google Scholar
Yang Y, Wang F, Zhang J, Xu J, Philip SY (2018) A topic model for co-occurring normal documents and short texts. World Wide Web 21(2):487–513
Article Google Scholar
Zuo Y, Wu J, Zhang H, Lin H, Wang F, Xu K, Xiong H (2016) Topic modeling of short texts: a pseudo-document view. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 2105–2114
Quan X, Kit C, Ge Y, Pan SJ (2015) Short and sparse text topic modeling via self-aggregation. In: IJCAI, pp 2270–2276
Cheng X, Yan X, Lan Y, Guo J (2014) BTM: topic modeling over short texts. IEEE Trans Knowl Data Eng 26(12):2928–2941
Article Google Scholar
Wang C, Paisley JW, Blei DM (2011) Online variational inference for the hierarchical Dirichlet process. In: AISTATS, vol 2, p 4
Broderick T, Boyd N, Wibisono A, Wilson AC, Jordan MI (2013) Streaming variational Bayes. In: Proceedings of advances in neural information processing systems conferences. Curran Associates, pp 1727–1735
Duc AN, Van Linh N, Kim AN, Than K (2017) Keeping priors in streaming Bayesian learning. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 247–258
Mimno D, Wallach HM, Talley E, Leenders M, McCallum A (2011) Optimizing semantic coherence in topic models. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 262–272
Yan X, Guo J, Lan Y, Cheng X (2013) A biterm topic model for short texts. In: Proceedings of the 22nd international conference on world wide web. ACM, pp 1445–1456
Hoffman M, Bach FR, Blei DM (2010) Online learning for latent Dirichlet allocation. In: Proceedings of advances in neural information processing systems conferences. Curran Associates, pp 856–864
Hoffman MD, Blei DM, Wang C, Paisley J (2013) Stochastic variational inference. J Mach Learn Res 14(1):1303–1347
MathSciNet MATH Google Scholar
Mai K, Mai S, Nguyen A, Van Linh N, Than K (2006) Enabling hierarchical Dirichlet processes to work better for short texts at large scale. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 431–442
Chang C-C, Lin C-J, (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27:1–27:27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Bouma G (2009) Normalized (pointwise) mutual information in collocation extraction. In: Proceedings of GSCL, pp 31–40

Download references

Acknowledgements

This research is supported by Vingroup Innovation Foundation (VINIF) in project code VINIF.2019.DA18, and by the Office of Naval Research Global (ONRG) under Award Number N62909-18-1-2072, and Air Force Office of Scientific Research (AFOSR), Asian Office of Aerospace Research & Development (AOARD) under Award Number 17IOA031.

Author information

Anh Phan Tuan and Bach Tran have contributed equally to this work as first authors.

Authors and Affiliations

School of Information and Communication Technology, Hanoi University of Science and Technology, No. 1, Dai Co Viet Road, Hanoi, Vietnam
Anh Phan Tuan, Bach Tran, Linh Ngo Van & Khoat Than
Department of Computer and Information Science, University of Oregon, Eugene, USA
Thien Huu Nguyen

Authors

Anh Phan Tuan
View author publications
You can also search for this author in PubMed Google Scholar
Bach Tran
View author publications
You can also search for this author in PubMed Google Scholar
Thien Huu Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Linh Ngo Van
View author publications
You can also search for this author in PubMed Google Scholar
Khoat Than
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Linh Ngo Van.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This paper is an extended version of our PAKDD2016 paper [33].

Appendices

Appendix A: Supplementary experimental result

To strengthen the experimental result in Sect. 5, we conduct some experiments with another evaluation metrics besides log predictive probability (LPP), i.e., normalized pointwise mutual information (NPMI) [36]. We evaluate the NMI of the two models, online HDP-B and online LDA-B, compared with their base models using BoW and BoB. The settings and model in use are exactly the same as those in Sect. 5.1.1. We adopt the four datasets: Tweet, NYtimes, Yahoo and TMNTitle.

Normalized pointwise mutual information (NPMI) A standard metric to measure the association between a pair of discrete outcomes x and y, defined as:

$$\begin{aligned} \text {NPMI}(w_i) = \sum _{j}^{N-1} \frac{\log \frac{P(w_i, w_j)}{P(w_i) P(w_j)}}{-\log P(w_i, w_j)} \end{aligned}$$

(9)

Figures 19 and 20 show the evaluation of these models by NPMI. Figure 19 confirms that OHDP-B performs better than HDP in almost all of the cases. Figure 19 shows that OLDA-B achieves better result than the other three models, while OBTM ranks behind only OLDA-B in NYtimes, Yahoo. Also, in most of the cases in both the two figures, the results of using BoW show that this representation is not suitable with short text datasets.

Appendix B: Conversion of topic-over-biterms (distribution over biterms) to topic-over-words (distribution over words)

In BoB, after we finish training the model, we obtain topics that are multinomial distributions over biterms. We would like to convert these topic-over-biterms to the topics-over-words (i.e., distribution over words). Assume that $\varvec{\phi }_k$ is a distribution over biterms of topic k. The procedure to perform this conversion is as follows: $p(w_i \mid z=k) = \sum _{j=1}^{V}p(w_i,w_j \mid z=k) = \sum _{j=1}^{V}p(b_{ij} \mid z=k) = \sum _{j=1}^{V}\phi _{kb_{ij}}$, where V is the vocabulary size in BoW and $b_{ij}$ is the biterm created from the pair ($w_i,w_j$).

As discussed in Sect. 3.1.1, in the implementation of BoB, we can merge $b_{ij}$ and $b_{ji}$ into $b_{ij}$ with $i<j$. Because of the identical occurrence in every document, after finishing the training process, the value of $p(b_{ij}\mid z=k)$ will be expectedly the same as $p(b_{ji}\mid z=k)$. Therefore, in grouping these biterms into one, the conversion version of this implementation is: $ p(w_i\mid z=k) = \sum _{j=1}^{V}p(b_{ij} \mid z=k) = \phi _{kb_{ii}} + \frac{1}{2}\sum _{\text {b: biterms contain } w_i}\phi _{kb} $

Appendix C: Parameters inference for LDA-B

1.1 C.1 Lower bound function

The log likelihood is bounded by the lower bound induced from the Jensen inequality:

$$\begin{aligned} L&= \log P (\beta , \theta , z, \tilde{z} \mid \lambda , \gamma , \phi ) \ge E_{q} [ \log P(\beta , \theta , z, \tilde{z}, w \mid \eta , \alpha ] \\&\quad - E_{q} [\log q (\beta , \theta , z, \tilde{z} \mid \gamma , \lambda , \phi ) ] \end{aligned}$$

This lower bound can be written as follows:

$$\begin{aligned} L&= \sum _{k=1}^{K} E_{q} \left[ \log P (\beta _k \mid \eta _k) \right] + \sum _{d=1}^{D} E_{q} \left[ \log P (\theta _d \mid \alpha ) \right] + \sum _{d=1}^{D} \sum _{n=1}^{N_d} E_{q} \left[ \log P (z_{dn} \mid \theta _d) \right] \\&\quad + \sum _{d=1}^{D} \sum _{m=1}^{M_d} E_{q} \left[ \log P (\tilde{z}_{dm} \mid \theta _d) \right] + \sum _{d=1}^{D} \sum _{n=1}^{N_d} E_{q} \left[ \log P (w_{dn} \mid z_{dn}) \right] \\&\quad + \sum _{d=1}^{D} \sum _{m=1}^{M_d} E_{q} \left[ \log P \left( \tilde{w}^{(1)}_{dm}, \tilde{w}^{(2)}_{dm} \mid \tilde{z}_{dm}\right) \right] - \sum _{k=1}^{K} E_{q} \left[ \log q (\beta _k \mid \lambda _k) \right] \\&\quad - \sum _{d=1}^{D} E_{q} \left[ \log q (\theta _d \mid \gamma _d) \right] \\&\quad - \sum _{d=1}^{D} \sum _{n=1}^{N_d} E_{q} \left[ \log q (z_{dn} \mid \phi _{dn}) \right] - \sum _{d=1}^{D} \sum _{m=1}^{M_d} E_{q} \left[ \log q \left( \tilde{z}_{dm} \mid \tilde{\phi }_{dm}\right) \Big ) \right] \end{aligned}$$

where $\tilde{w}^{(1)}_{dm}$ and $\tilde{w}^{(2)}_{dm} $ denoted as the first word and second word of biterm $\tilde{w}_{dm} $. $\tilde{z}_{dm} $ is the topic assignment of $\tilde{w}_{dm} $. The expansion for this equation is similar to [2]. The only difference part lies in the biterm $(\tilde{w}^{(1)}_{dm}, \tilde{w}^{(2)}_{dm})$:

$$\begin{aligned} E_{q} \left[ \log P \left( \tilde{w}^{(1)}_{dm}, \tilde{w}^{(2)}_{dm} \mid \tilde{z}_{dm}\right) \right]&= E_{q} \left[ \log P \left( \tilde{w}^{(1)}_{dm} \mid \tilde{z}_{dm}\right) \right] + E_{q} \left[ \log P \left( \tilde{w}^{(2)}_{dm} \mid \tilde{z}_{dm}\right) \right] \\&= \sum _{v=1}^V \sum _{k=1}^K I\{ \tilde{w}^{(1)}_{dm} = v \} \tilde{\phi }_{dmk} \left( \psi (\lambda _{kv}) - \psi \left( \sum _{u=1}^{V} \lambda _{ku} \right) \right) \\&\quad + \sum _{v=1}^V \sum _{k=1}^K I\{ \tilde{w}^{(2)}_{dm} = v \} \tilde{\phi }_{dmk} \left( \psi (\lambda _{kv}) - \psi \left( \sum _{u=1}^{V} \lambda _{ku} \right) \right) . \end{aligned}$$

Then, we will maximize the lower bound function in each dimension of the variational parameters.

1.2 C.2 Variation parameter $\lambda $

Choose a topic index k. Fix $\gamma , \phi $ and each $\lambda _j $ for $ j \ne k$. The updated value for parameter $\lambda $ is computed by set the derivative of $L(\lambda _k)$ to zero:

$$\begin{aligned} \lambda _{k, v} \leftarrow \eta _{k, v} + \sum _{d=1}^D \sum _{n=1}^{N_d} I\left\{ \tilde{w}^{(1)}_{dn} = v \right\} \phi _{dnk} + \sum _{d=1}^D \sum _{m=1}^{M_d} \left[ I\left\{ \tilde{w}^{(1)}_{dm} = v \right\} + I\left\{ \tilde{w}^{(2)}_{dm} = v \right\} \right] \tilde{\phi }_{dmk}. \end{aligned}$$

(10)

1.3 C.3 Variation parameter $\gamma $

Choose a document d, fix $\lambda , \phi $ and each $\gamma _c $ for $ c \ne d$. Similar to $\lambda $, we obtain:

$$\begin{aligned} \gamma _{d, k} = \alpha _{k} + \sum _{n=1}^{N_d} \phi _{dnk} + \sum _{m=1}^{M_d} \tilde{\phi }_{dmk}. \end{aligned}$$

(11)

1.4 C.4 Variation parameter $\phi $

Fixing $\lambda , \gamma , \tilde{\phi } $ and $ \phi _{cu} $ for $(c,u) \ne (d,v) $. Using the Lagrange multipliers method with constraint $\sum _{k=1}^K \phi _{dnk} = 1 $, we have:

$$\begin{aligned} \phi _{dnk} \propto exp \left\{ E_{q} [\log \theta _{d, k}] + E_{q}\left[ \log \beta _{k,w_{dn}}\right] \right\} . \end{aligned}$$

(12)

1.5 C.5 Variation parameter $\tilde{\phi }$

Fixing $\lambda , \gamma , \phi $ and $ \tilde{\phi }_{cu} $ for $(c,u) \ne (d,b) $. Using the Lagrange multipliers method with constraint $\sum _{k=1}^K \tilde{\phi }_{dmk} = 1 $, we obtain:

$$\begin{aligned} \tilde{\phi }_{d, m, k} \propto exp \left\{ E_{q}[\log \theta _{d, k}] + E_{q}\left[ log \beta _{k,w_{dm}^{(1)}}\right] + E_{q}\left[ \log \beta _{k,w_{dm}^{(2)}}\right] \right\} . \end{aligned}$$

(13)

Here, we denote $\tilde{w}_{dm}^{(1)}$ and $\tilde{w}_{dm}^{(2)}$ as the first word and second word of biterm $\tilde{w}_{dm}$, respectively.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tuan, A.P., Tran, B., Nguyen, T.H. et al. Bag of biterms modeling for short texts. Knowl Inf Syst 62, 4055–4090 (2020). https://doi.org/10.1007/s10115-020-01482-z

Download citation

Received: 20 September 2018
Revised: 08 June 2020
Accepted: 13 June 2020
Published: 10 July 2020
Issue Date: October 2020
DOI: https://doi.org/10.1007/s10115-020-01482-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bag of biterms modeling for short texts

Abstract

Access this article

Similar content being viewed by others

Enabling Hierarchical Dirichlet Processes to Work Better for Short Texts at Large Scale