Skip to main content
Log in

Bag of biterms modeling for short texts

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Analyzing texts from social media encounters many challenges due to their unique characteristics of shortness, massiveness, and dynamic. Short texts do not provide enough context information, causing the failure of the traditional statistical models. Furthermore, many applications often face with massive and dynamic short texts, causing various computational challenges to the current batch learning algorithms. This paper presents a novel framework, namely bag of biterms modeling (BBM), for modeling massive, dynamic, and short text collections. BBM comprises of two main ingredients: (1) the concept of bag of biterms (BoB) for representing documents, and (2) a simple way to help statistical models to include BoB. Our framework can be easily deployed for a large class of probabilistic models, and we demonstrate its usefulness with two well-known models: latent Dirichlet allocation (LDA) and hierarchical Dirichlet process (HDP). By exploiting both terms (words) and biterms (pairs of words), the major advantages of BBM are: (1) it enhances the length of the documents and makes the context more coherent by emphasizing the word connotation and co-occurrence via bag of biterms, and (2) it inherits inference and learning algorithms from the primitive to make it straightforward to design online and streaming algorithms for short texts. Extensive experiments suggest that BBM outperforms several state-of-the-art models. We also point out that the BoB representation performs better than the traditional representations (e.g., bag of words, tf-idf) even for normal texts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Notes

  1. http://acube.di.unipi.it/TMNdataset.

  2. https://answers.yahoo.com/.

  3. http://twitter.com/.

  4. http://www.nytimes.com/.

  5. We use the source code of Online LDA and Online HDP from https://github.com/Blei-Lab.

  6. http://qwone.com/~jason/20Newsgroups/.

  7. http://glaros.dtc.umn.edu/gkhome/fetch/sw/cluto/datasets.tar.gz.

  8. http://davis.wpi.edu/xmdv/datasets/ohsumed.html.

  9. https://www.medline.com.

  10. https://www.csie.ntu.edu.tw/~cjlin/libsvm/.

  11. https://nlp.stanford.edu/projects/glove/.

References

  1. Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of the fifteenth conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., Burlington, pp 289–296

  2. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  3. Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical Dirichlet processes. J Am Stat Assoc 101(476):1566–1581

    Article  MathSciNet  Google Scholar 

  4. Tang J, Meng Z, Nguyen X, Mei Q, Zhang M (2014) Understanding the limiting factors of topic modeling via posterior contraction analysis. In: Proceedings of The 31st international conference on machine learning, pp 190–198

  5. Than K, Doan T (2014) Dual online inference for latent Dirichlet allocation. In: ACML

  6. Sahami M, Heilman TD (2006) A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of the 15th international conference on world wide web. ACM, pp 377–386

  7. Bollegala D, Matsuo Y, Ishizuka M (2007) Measuring semantic similarity between words using web search engines. In: www, vol 7, pp 757–766

  8. Yih W-T, Meek C (2007) Improving similarity measures for short segments of text. In: AAAI, vol 7, pp 1489–1494

  9. Banerjee S, Ramanathan K, Gupta A (2007) Clustering short texts using wikipedia. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 787–788

  10. Schönhofen P (2009) Identifying document topics using the wikipedia category network. Web Intell Agent Syst 7(2):195–207

    Article  Google Scholar 

  11. Phan X-H, Nguyen L-M, Horiguchi S (2008) Learning to classify short and sparse text and web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on world wide web. ACM, pp 91–100

  12. Mehrotra R, Sanner S, Buntine W, Xie L (2013) Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 889–892

  13. Grant CE, George CP, Jenneisch C, Wilson JN (2011) Online topic modeling for real-time Twitter search. In: TREC

  14. Ye C, Wen W (2014) PY: TM-HDP—an effective nonparametric topic model for Tibetan messages. J Comput Inf Syst 10:10433–10444

    Google Scholar 

  15. Hong L, Davison BD (2010) Empirical study of topic modeling in Twitter. In: Proceedings of the first workshop on social media analytics. ACM, pp 80–88

  16. Qiang J, Chen P, Wang T, Wu X (2017) Topic modeling over short texts by incorporating word embeddings. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 363–374

  17. Zhao H, Du L, Buntine W (2017) A word embeddings informed focused topic model. In: Asian conference on machine learning, pp 423–438

  18. Li C, Duan Y, Wang H, Zhang Z, Sun A, Ma Z (2017) Enhancing topic modeling for short texts with auxiliary word embeddings. ACM Trans Inf Syst (TOIS) 36(2):11

    Article  Google Scholar 

  19. Weng J, Lim E-P, Jiang J, He Q (2010) Twitterrank: finding topic-sensitive influential Twitterers. In: Proceedings of the third ACM international conference on web search and data mining. ACM, pp 261–270

  20. Jiang L, Lu H, Xu M, Wang C (2016) Biterm pseudo document topic model for short text. In: 2016 IEEE 28th International conference on tools with artificial intelligence (ICTAI). IEEE, pp 865–872

  21. Bicalho P, Pita M, Pedrosa G, Lacerda A, Pappa GL (2017) A general framework to expand short text for topic modeling. Inf Sci 393:66–81

    Article  Google Scholar 

  22. Yang Y, Wang F, Zhang J, Xu J, Philip SY (2018) A topic model for co-occurring normal documents and short texts. World Wide Web 21(2):487–513

    Article  Google Scholar 

  23. Zuo Y, Wu J, Zhang H, Lin H, Wang F, Xu K, Xiong H (2016) Topic modeling of short texts: a pseudo-document view. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 2105–2114

  24. Quan X, Kit C, Ge Y, Pan SJ (2015) Short and sparse text topic modeling via self-aggregation. In: IJCAI, pp 2270–2276

  25. Cheng X, Yan X, Lan Y, Guo J (2014) BTM: topic modeling over short texts. IEEE Trans Knowl Data Eng 26(12):2928–2941

    Article  Google Scholar 

  26. Wang C, Paisley JW, Blei DM (2011) Online variational inference for the hierarchical Dirichlet process. In: AISTATS, vol 2, p 4

  27. Broderick T, Boyd N, Wibisono A, Wilson AC, Jordan MI (2013) Streaming variational Bayes. In: Proceedings of advances in neural information processing systems conferences. Curran Associates, pp 1727–1735

  28. Duc AN, Van Linh N, Kim AN, Than K (2017) Keeping priors in streaming Bayesian learning. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 247–258

  29. Mimno D, Wallach HM, Talley E, Leenders M, McCallum A (2011) Optimizing semantic coherence in topic models. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 262–272

  30. Yan X, Guo J, Lan Y, Cheng X (2013) A biterm topic model for short texts. In: Proceedings of the 22nd international conference on world wide web. ACM, pp 1445–1456

  31. Hoffman M, Bach FR, Blei DM (2010) Online learning for latent Dirichlet allocation. In: Proceedings of advances in neural information processing systems conferences. Curran Associates, pp 856–864

  32. Hoffman MD, Blei DM, Wang C, Paisley J (2013) Stochastic variational inference. J Mach Learn Res 14(1):1303–1347

    MathSciNet  MATH  Google Scholar 

  33. Mai K, Mai S, Nguyen A, Van Linh N, Than K (2006) Enabling hierarchical Dirichlet processes to work better for short texts at large scale. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 431–442

  34. Chang C-C, Lin C-J, (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27:1–27:27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

  35. Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 1532–1543

  36. Bouma G (2009) Normalized (pointwise) mutual information in collocation extraction. In: Proceedings of GSCL, pp 31–40

Download references

Acknowledgements

This research is supported by Vingroup Innovation Foundation (VINIF) in project code VINIF.2019.DA18, and by the Office of Naval Research Global (ONRG) under Award Number N62909-18-1-2072, and Air Force Office of Scientific Research (AFOSR), Asian Office of Aerospace Research & Development (AOARD) under Award Number 17IOA031.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Linh Ngo Van.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This paper is an extended version of our PAKDD2016 paper [33].

Appendices

Appendix A: Supplementary experimental result

To strengthen the experimental result in Sect. 5, we conduct some experiments with another evaluation metrics besides log predictive probability (LPP), i.e., normalized pointwise mutual information (NPMI) [36]. We evaluate the NMI of the two models, online HDP-B and online LDA-B, compared with their base models using BoW and BoB. The settings and model in use are exactly the same as those in Sect. 5.1.1. We adopt the four datasets: Tweet, NYtimes, Yahoo and TMNTitle.

Fig. 19
figure 19

The NPMI of online HDP-B and online HDP (with BoW and BoB)

Fig. 20
figure 20

The NPMI of online LDA-B, Online BTM and Online LDA (with BoB and BoW)

Normalized pointwise mutual information (NPMI) A standard metric to measure the association between a pair of discrete outcomes x and y, defined as:

$$\begin{aligned} \text {NPMI}(w_i) = \sum _{j}^{N-1} \frac{\log \frac{P(w_i, w_j)}{P(w_i) P(w_j)}}{-\log P(w_i, w_j)} \end{aligned}$$
(9)

Figures 19 and 20 show the evaluation of these models by NPMI. Figure 19 confirms that OHDP-B performs better than HDP in almost all of the cases. Figure 19 shows that OLDA-B achieves better result than the other three models, while OBTM ranks behind only OLDA-B in NYtimes, Yahoo. Also, in most of the cases in both the two figures, the results of using BoW show that this representation is not suitable with short text datasets.

Appendix B: Conversion of topic-over-biterms (distribution over biterms) to topic-over-words (distribution over words)

In BoB, after we finish training the model, we obtain topics that are multinomial distributions over biterms. We would like to convert these topic-over-biterms to the topics-over-words (i.e., distribution over words). Assume that \(\varvec{\phi }_k\) is a distribution over biterms of topic k. The procedure to perform this conversion is as follows: \(p(w_i \mid z=k) = \sum _{j=1}^{V}p(w_i,w_j \mid z=k) = \sum _{j=1}^{V}p(b_{ij} \mid z=k) = \sum _{j=1}^{V}\phi _{kb_{ij}}\), where V is the vocabulary size in BoW and \(b_{ij}\) is the biterm created from the pair (\(w_i,w_j\)).

As discussed in Sect. 3.1.1, in the implementation of BoB, we can merge \(b_{ij}\) and \(b_{ji}\) into \(b_{ij}\) with \(i<j\). Because of the identical occurrence in every document, after finishing the training process, the value of \(p(b_{ij}\mid z=k)\) will be expectedly the same as \(p(b_{ji}\mid z=k)\). Therefore, in grouping these biterms into one, the conversion version of this implementation is: \( p(w_i\mid z=k) = \sum _{j=1}^{V}p(b_{ij} \mid z=k) = \phi _{kb_{ii}} + \frac{1}{2}\sum _{\text {b: biterms contain } w_i}\phi _{kb} \)

Appendix C: Parameters inference for LDA-B

1.1 C.1 Lower bound function

The log likelihood is bounded by the lower bound induced from the Jensen inequality:

$$\begin{aligned} L&= \log P (\beta , \theta , z, \tilde{z} \mid \lambda , \gamma , \phi ) \ge E_{q} [ \log P(\beta , \theta , z, \tilde{z}, w \mid \eta , \alpha ] \\&\quad - E_{q} [\log q (\beta , \theta , z, \tilde{z} \mid \gamma , \lambda , \phi ) ] \end{aligned}$$

This lower bound can be written as follows:

$$\begin{aligned} L&= \sum _{k=1}^{K} E_{q} \left[ \log P (\beta _k \mid \eta _k) \right] + \sum _{d=1}^{D} E_{q} \left[ \log P (\theta _d \mid \alpha ) \right] + \sum _{d=1}^{D} \sum _{n=1}^{N_d} E_{q} \left[ \log P (z_{dn} \mid \theta _d) \right] \\&\quad + \sum _{d=1}^{D} \sum _{m=1}^{M_d} E_{q} \left[ \log P (\tilde{z}_{dm} \mid \theta _d) \right] + \sum _{d=1}^{D} \sum _{n=1}^{N_d} E_{q} \left[ \log P (w_{dn} \mid z_{dn}) \right] \\&\quad + \sum _{d=1}^{D} \sum _{m=1}^{M_d} E_{q} \left[ \log P \left( \tilde{w}^{(1)}_{dm}, \tilde{w}^{(2)}_{dm} \mid \tilde{z}_{dm}\right) \right] - \sum _{k=1}^{K} E_{q} \left[ \log q (\beta _k \mid \lambda _k) \right] \\&\quad - \sum _{d=1}^{D} E_{q} \left[ \log q (\theta _d \mid \gamma _d) \right] \\&\quad - \sum _{d=1}^{D} \sum _{n=1}^{N_d} E_{q} \left[ \log q (z_{dn} \mid \phi _{dn}) \right] - \sum _{d=1}^{D} \sum _{m=1}^{M_d} E_{q} \left[ \log q \left( \tilde{z}_{dm} \mid \tilde{\phi }_{dm}\right) \Big ) \right] \end{aligned}$$

where \(\tilde{w}^{(1)}_{dm}\) and \(\tilde{w}^{(2)}_{dm} \) denoted as the first word and second word of biterm \(\tilde{w}_{dm} \). \(\tilde{z}_{dm} \) is the topic assignment of \(\tilde{w}_{dm} \). The expansion for this equation is similar to [2]. The only difference part lies in the biterm \((\tilde{w}^{(1)}_{dm}, \tilde{w}^{(2)}_{dm})\):

$$\begin{aligned} E_{q} \left[ \log P \left( \tilde{w}^{(1)}_{dm}, \tilde{w}^{(2)}_{dm} \mid \tilde{z}_{dm}\right) \right]&= E_{q} \left[ \log P \left( \tilde{w}^{(1)}_{dm} \mid \tilde{z}_{dm}\right) \right] + E_{q} \left[ \log P \left( \tilde{w}^{(2)}_{dm} \mid \tilde{z}_{dm}\right) \right] \\&= \sum _{v=1}^V \sum _{k=1}^K I\{ \tilde{w}^{(1)}_{dm} = v \} \tilde{\phi }_{dmk} \left( \psi (\lambda _{kv}) - \psi \left( \sum _{u=1}^{V} \lambda _{ku} \right) \right) \\&\quad + \sum _{v=1}^V \sum _{k=1}^K I\{ \tilde{w}^{(2)}_{dm} = v \} \tilde{\phi }_{dmk} \left( \psi (\lambda _{kv}) - \psi \left( \sum _{u=1}^{V} \lambda _{ku} \right) \right) . \end{aligned}$$

Then, we will maximize the lower bound function in each dimension of the variational parameters.

1.2 C.2 Variation parameter \(\lambda \)

Choose a topic index k. Fix \(\gamma , \phi \) and each \(\lambda _j \) for \( j \ne k\). The updated value for parameter \(\lambda \) is computed by set the derivative of \(L(\lambda _k)\) to zero:

$$\begin{aligned} \lambda _{k, v} \leftarrow \eta _{k, v} + \sum _{d=1}^D \sum _{n=1}^{N_d} I\left\{ \tilde{w}^{(1)}_{dn} = v \right\} \phi _{dnk} + \sum _{d=1}^D \sum _{m=1}^{M_d} \left[ I\left\{ \tilde{w}^{(1)}_{dm} = v \right\} + I\left\{ \tilde{w}^{(2)}_{dm} = v \right\} \right] \tilde{\phi }_{dmk}. \end{aligned}$$
(10)

1.3 C.3 Variation parameter \(\gamma \)

Choose a document d, fix \(\lambda , \phi \) and each \(\gamma _c \) for \( c \ne d\). Similar to \(\lambda \), we obtain:

$$\begin{aligned} \gamma _{d, k} = \alpha _{k} + \sum _{n=1}^{N_d} \phi _{dnk} + \sum _{m=1}^{M_d} \tilde{\phi }_{dmk}. \end{aligned}$$
(11)

1.4 C.4 Variation parameter \(\phi \)

Fixing \(\lambda , \gamma , \tilde{\phi } \) and \( \phi _{cu} \) for \((c,u) \ne (d,v) \). Using the Lagrange multipliers method with constraint \(\sum _{k=1}^K \phi _{dnk} = 1 \), we have:

$$\begin{aligned} \phi _{dnk} \propto exp \left\{ E_{q} [\log \theta _{d, k}] + E_{q}\left[ \log \beta _{k,w_{dn}}\right] \right\} . \end{aligned}$$
(12)

1.5 C.5 Variation parameter \(\tilde{\phi }\)

Fixing \(\lambda , \gamma , \phi \) and \( \tilde{\phi }_{cu} \) for \((c,u) \ne (d,b) \). Using the Lagrange multipliers method with constraint \(\sum _{k=1}^K \tilde{\phi }_{dmk} = 1 \), we obtain:

$$\begin{aligned} \tilde{\phi }_{d, m, k} \propto exp \left\{ E_{q}[\log \theta _{d, k}] + E_{q}\left[ log \beta _{k,w_{dm}^{(1)}}\right] + E_{q}\left[ \log \beta _{k,w_{dm}^{(2)}}\right] \right\} . \end{aligned}$$
(13)

Here, we denote \(\tilde{w}_{dm}^{(1)}\) and \(\tilde{w}_{dm}^{(2)}\) as the first word and second word of biterm \(\tilde{w}_{dm}\), respectively.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tuan, A.P., Tran, B., Nguyen, T.H. et al. Bag of biterms modeling for short texts. Knowl Inf Syst 62, 4055–4090 (2020). https://doi.org/10.1007/s10115-020-01482-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-020-01482-z

Keywords

Navigation