Abstract
The problem of tweet popularity prediction, or forecasting the total number of retweets stemming from an ancestral tweet, has attracted considerable interest recently. The prediction can be accomplished by fitting a point process model to the sequence of retweet times up to a certain censoring time and project the fitted model to a future time point. However, models employing such approach tend to have inferior prediction accuracy when the censoring time is too short before sufficient information can accumulate. To overcome this, we propose an empirical Bayes type approach of parameter estimation to combine internal knowledge on the times of historical retweets up to the censoring time and external knowledge on complete retweet sequences in the training data. We demonstrate the approach using several point process models with finite-dimensional parameters, where the prior distribution for the parameter of each model is constructed based on the external knowledge, and the likelihood is calculated based on the internal knowledge. The mode of the posterior distribution is used as the estimator of the finite-dimensional parameter, and the mean of the predictive distribution for the number of retweets implied by each of the estimated models is used to predict the tweet popularity. Using a large Twitter data set, we reveal that the proposed methodology not only enables prediction at time zero before the arrival of any retweet event, but also substantially improves the prediction performances of existing models, especially at earlier censoring times.
Similar content being viewed by others
Data Availability Statement
Data are available from http://snap.stanford.edu/seismic/
References
Bandari, R., Asur, S., Huberman, B.: The pulse of news in social media: Forecasting popularity. In: ICWSM 2012 - Proceedings of the 6th International AAAI Conference on Weblogs and Social Media (2012)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Chen, F., Tan, W.H.: Marked self-exciting point process modelling of information diffusion on Twitter. Ann. Appl. Stat. 12(4), 2175–2196 (2018)
Cleveland, W.S., Devlin, S.J.: Locally weighted regression: an approach to regression analysis by local fitting. J. Am. Stat. Assoc. 83(403), 596–610 (1988)
Cowling, A., Hall, P.: On pseudodata methods for removing boundary effects in kernel density estimation. J. R. Stat. Soc.: Ser. B (Methodol.) 58(3), 551–563 (1996)
Daley, D.J., Vere-Jones, D.: An Introduction to the Theory of Point Processes Volume I: Elementary Theory and Methods, 2nd edn. Springer, New York (2003)
Eysenbach, G.: Can tweets predict citations? metrics of social impact based on Twitter and correlation with traditional metrics of scientific impact. J. Med. Internet Res. 13(4), (2011)
Golub, G.H., Heath, M., Wahba, G.: Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics 21(2), 215–223 (1979)
Hong, L., Dan, O., Davison, BD.: Predicting popular messages in Twitter. In: Proceedings of the 20th international conference companion on World wide web, ACM, pp. 57–58 (2011)
Kant, G., Weisser, C., Säfken, B.: TTLocVis: A Twitter topic location visualization package. J. Open Sour. Software 5(25), (2020)
Kobayashi, R., Lambiotte, R.: TiDeH: time-dependent Hawkes process for predicting retweet dynamics. In: Proceedings of the Tenth International AAAI Conference on Web and Social Media (ICWSM 2016), pp. 191–200 (2016)
Ma, Z., Sun, A., Cong, G.: On predicting the popularity of newly emerging hashtags in Twitter. J. Am. Soc. Inform. Sci. Technol. 64(7), 1399–1410 (2013)
Malmgren, R.D., Stouffer, D.B., Motter, A.E., Amaral, L.A.: A Poissonian explanation for heavy tails in e-mail communication. Proc. Nat. Acad. Sci. 105(47), 18153–18158 (2008)
Mishra, S., Rizoiu, MA., Xie, L.: Feature driven and point process approaches for popularity prediction. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, ACM, pp. 1069–1078 (2016)
R Core Team.: R: A language and environment for statistical computing (2019)
Silverman, B.: Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC Monographs on Statistics & Applied Probability, Taylor & Francis (1986)
Van Aelst, P., van Erkel, P., D’heer, E., Harder, R.A.: Who is leading the campaign charts? Comparing individual popularity on old and new media. Inform. Commun. Soc. 20(5), 715–732 (2017)
Xie, M., Singh, K.: Confidence distribution, the frequentist distribution estimator of a parameter: a review. Int. Stat. Rev. 81(1), 3–39 (2013)
Yang, J., Leskovec, J.: Patterns of temporal variation in online media. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining. ACM, pp. 177–186 (2011)
Yang, M., Chen, K., Miao, Z., Yang, X.: Cost-effective user monitoring for popularity prediction of online user-generated content. In: 2014 IEEE International Conference on Data Mining Workshop, pp. 944–951 (2014)
Zhao, Q., Erdogdu, M.A., He, H.Y., Rajaraman, A., Leskovec, J.: SEISMIC: a self-exciting point process model for predicting tweet popularity. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp. 1513–1522 (2015)
Acknowledgements
The authors gratefully acknowledge the constructive comments from the reviewers, which have led to improved presentation. This research includes computations using the computational cluster Katana supported by Research Technology Services at UNSW Sydney. The research also benefited from the assistance of resources from the National Computational Infrastructure (NCI), supported by the Australian Government.
Funding
Tan was supported by UMK Fundamental Research Grant [R/FUND/A0100/01348A/001/2020/00840] Chen was partly supported by UNSW Science Faculty Research Grant [PS35307]
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
Not applicable.
Code availability
Code is available from the authors upon request
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
See Fig. 5.
Rights and permissions
About this article
Cite this article
Tan, W.H., Chen, F. Predicting the popularity of tweets using internal and external knowledge: an empirical Bayes type approach. AStA Adv Stat Anal 105, 335–352 (2021). https://doi.org/10.1007/s10182-021-00390-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10182-021-00390-z