Sentiment analysis on big sparse data streams with limited labels

Iosifidis, Vasileios; Ntoutsi, Eirini

doi:10.1007/s10115-019-01392-9

Sentiment analysis on big sparse data streams with limited labels

Regular Paper
Published: 17 August 2019

Volume 62, pages 1393–1432, (2020)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

571 Accesses
9 Citations
2 Altmetric
Explore all metrics

Abstract

Sentiment analysis is an important task in order to gain insights over the huge amounts of opinionated texts generated on a daily basis in social media like Twitter. Despite its huge amount, standard supervised learning methods won’t work upon such sort of data due to lack of labels and the impracticality of (human) labeling at this scale. In this work, we leverage distant supervision and semi-supervised learning to annotate a big stream of tweets from 2015 which consists of 228 million tweets without retweets (and 275 million with retweets). We present the insights from our annotation process regarding the effect of different semi-supervised learning approaches, namely Self-Learning, Co-Training and Expectation–Maximization. Moreover, we propose two annotation modes, the batch mode where all labeled and unlabeled data are available to the algorithms from the beginning and a lightweight streaming mode that processes the data in batches based on their arrival time in the stream. Our experiments show that stream processing with a sliding window of three months achieves comparable results to batch processing while being more efficient. Finally, to tackle the class imbalance problem, as our dataset is imbalanced toward the positive sentiment class, and its aggravation by the semi-supervised learning methods, we employ data augmentation in the semi-supervised learning process in order to equalize the class distribution. Our results show that semi-supervised learning coupled with data augmentation outperforms significantly the default semi-supervised annotation process. We make the so-called TSentiment15 sentiment-annotated dataset available to the community to be used for evaluation purposes and for developing new methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semi-Supervised Sentiment Analysis of Portuguese Tweets with Random Walk in Feature Sample Networks

Semi-supervised Sentiment Annotation of Large Corpora

On Assessing the Sentiment of General Tweets

Notes

The Twitter crawling collection project is part of the L3S research center initiative.
https://dev.twitter.com/streaming/overview.
www.noslang.com.
https://wordnet.princeton.edu/.
http://weka.sourceforge.net/doc.dev/weka/core/stopwords/Rainbow.html.
http://sentiwordnet.isti.cnr.it/.
Positive emoticons: c) \(=]\) : ] \(:\}\)\(;>\)\(:>)\)\(:^\wedge )\) : D\(=)\) ; ) : ) 8) ( : (; : o) \(:-)\) : P\(<3\) : 3 \(^\wedge \_^\wedge \)
Negative emoticons: \(-c\) : [ \(:\{\)\(:<\)\(:-(\) : / \(:-[\) : c\(:-<\) : ( \(:'\{\)\(>:[\)
http://wordnetcode.princeton.edu/glosstag.shtml.
Source code and data are available at: https://iosifidisvasileios.github.io/Semi-Supervised-Learning/.
http://www.crowdflower.com/.

References

Aue A, Gamon M (2005) Customizing sentiment classifiers to new domains: a case study. In: Proceedings of recent advances in natural language processing (RANLP), vol 1, pp 2–1
Baccianella S, Esuli A, Sebastiani F (2010) Sentiwordnet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In: LREC, vol 10, pp 2200–2204
Balcan M-F, Blum A, Yang K (2005) Co-training and expansion: towards bridging theory and practice. In: Advances in neural information processing systems, pp 89–96
Basaran D, Ntoutsi E, Zimek A (2017) Redundancies in data and their effect on the evaluation of recommendation systems: a case study on the amazon reviews datasets. In: Proceedings of the 2017 SIAM international conference on data mining, pp 390–398. SIAM
Berardi G, Esuli A, Sebastiani F, Silvestri F (2013) Endorsements and rebuttals in blog distillation. Inf Sci 249:38–47
Article Google Scholar
Bifet A, Frank E (2010) Sentiment knowledge discovery in twitter streaming data. In: International conference on discovery science. Springer, Berlin, pp 1–15
Biyani P, Caragea C, Mitra P, Zhou C, Yen J, Greer GE, Portier K (2013) Co-training over domain-independent and domain-dependent features for sentiment analysis of an online cancer support community. In: 2013 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM), pp 413–417. IEEE
Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the eleventh annual conference on computational learning theory, pp 92–100. ACM
Cozman FG, Cohen I, Cirelo MC (2003) Semi-supervised learning of mixture models. In: Proceedings of the 20th international conference on machine learning (ICML-03), pp 99–106
Dasgupta S, Ng V (2009) Mine the easy, classify the hard: a semi-supervised approach to automatic sentiment classification. In: Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP, vol 2, pp 701–709. Association for Computational Linguistics
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39:1–22
MathSciNet MATH Google Scholar
Drummond C, Holte RC et al (2003) C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on learning from imbalanced datasets II, vol 11, pp 1–8. Citeseer
Du J, Ling CX, Zhou Z-H (2011) When does cotraining work in real data? IEEE Trans Knowl Data Eng 23(5):788–799
Article Google Scholar
Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1):18–36
Article MathSciNet Google Scholar
Fafalios P, Iosifidis V, Ntoutsi E, Dietze S (2018a) Tweetskb: a public and large-scale RDF corpus of annotated tweets. In: European semantic web conference. Springer, Berlin, pp 177–190
Fafalios P, Iosifidis V, Stefanidis K, Ntoutsi E (2018b) Tracking the history and evolution of entities: entity-centric temporal analysis of large social media archives. Int J Digit Lib 1–13. https://doi.org/10.1007/s00799-018-0257-7
Fralick S (1967) Learning to recognize patterns without a teacher. IEEE Trans Inf Theory 13(1):57–64
Article Google Scholar
Gatti L, Guerini M, Turchi M (2016) Sentiwords: deriving a high precision and high coverage lexicon for sentiment analysis. IEEE Trans Affect Comput 7(4):409–421
Article Google Scholar
Globerson A, Roweis S (2006) Nightmare at test time: robust learning by feature deletion. In: Proceedings of the 23rd international conference on machine learning, pp 353–360. ACM
Go A, Bhayani R, Huang L (2009) Twitter sentiment classification using distant supervision. CS224N Proj Rep Stanf 1(12):2009
Google Scholar
Hamilton WL, Leskovec J, Jurafsky D (2016) Diachronic word embeddings reveal statistical laws of semantic change. arXiv preprint arXiv:1605.09096
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 19:1263–1284
Google Scholar
He H, Ma Y (2013) Imbalanced learning: foundations, algorithms, and applications. Wiley, New York
Book Google Scholar
He Y, Zhou D (2011) Self-training from labeled features for sentiment analysis. Inf Process Manag 47(4):606–616
Article Google Scholar
Hube C, Fetahu B (2019) Neural based statement classification for biased language. In: Proceedings of the twelfth ACM international conference on web search and data mining, pp 195–203. ACM
Iosifidis V, Ntoutsi E (2017) Large scale sentiment learning with limited labels. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1823–1832. ACM
Iosifidis V, Oelschlager A, Ntoutsi E (2017) Sentiment classification over opinionated data streams through informed model adaptation. In: International conference on theory and practice of digital libraries, pp 369–381. Springer, Berlin
Kaufmann M, Kalita J (2010) Syntactic normalization of twitter messages. In: International conference on natural language processing, Kharagpur, India
Kucuktunc O, Cambazoglu BB, Weber I, Ferhatosmanoglu H (2012) A large-scale sentiment analysis for yahoo! answers. In: Proceedings of the fifth ACM international conference on Web search and data mining, pp 633–642. ACM
Li S, Wang Z, Zhou G, Lee SYM (2011) Semi-supervised learning for imbalanced sentiment classification. In: IJCAI proceedings-international joint conference on artificial intelligence, vol 22, pp 1826
Liu S, Zhu W, Xu N, Li F, Cheng X-q, Liu Y, Wang Y (2013a) Co-training and visualizing sentiment evolvement for tweet events. In: Proceedings of the 22nd international conference on World Wide Web, pp 105–106. ACM
Liu Y, Yu X, An A, Huang X (2013b) Riding the tide of sentiment change: sentiment analysis with evolving online reviews. World Wide Web 16(4):477–496 ISSN 1386-145X
Article Google Scholar
Lucas M, Downey D (2013) Scaling semi-supervised naive bayes with feature marginals. In: Proceedings of the 51st annual meeting of the association for computational linguistics (Volume 1: Long Papers), vol 1, pp 343–351
Melidis DP, Campero AV, Iosifidis V, Ntoutsi E, Spiliopoulou M (2018a) Enriching lexicons with ephemeral words for sentiment analysis in social streams. In: Proceedings of the 8th international conference on web intelligence, mining and semantics, p 38. ACM
Melidis DP, Spiliopoulou M, Ntoutsi E (2018b) Learning under feature drifts in textual streams. In: Proceedings of the 27th ACM international conference on information and knowledge management, CIKM ’18, pp 527–536, New York, USA. ACM. ISBN 978-1-4503-6014-2
Melville P, Gryc W, Lawrence RD (2009) Sentiment analysis of blogs by combining lexical knowledge with text classification. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 1275–1284. ACM
Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S et al (2016) Mllib: Machine learning in apache spark. J Mach Learn Res 17(34):1–7
MathSciNet MATH Google Scholar
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Mohammad SM, Kiritchenko S, Zhu X (2013) NRC-Canada: building the state-of-the-art in sentiment analysis of tweets. arXiv preprint arXiv:1308.6242
Nigam K, Ghani R (2000) Analyzing the effectiveness and applicability of co-training. In: Proceedings of the ninth international conference on Information and knowledge management, pp 86–93. ACM
Nigam K, McCallum AK, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using em. Mach Learn 39(2–3):103–134
Article Google Scholar
Nigam K, McCallum A, Mitchell T (2006) Semi-supervised text classification using EM. In: Chapelle O, Scholkopf B, Zien A (eds) Semi-supervised learning. MIT Press. https://doi.org/10.7551/mitpress/9780262033589.003.0003
Nigam KP (2001) Using unlabeled data to improve text classification. Technical report, Carnegie-mellon univ Pittsburgh pa school of computer science
Paltoglou G, Thelwall M (2010) A study of information retrieval weighting schemes for sentiment analysis. In: Proceedings of the 48th annual meeting of the association for computational linguistics, pp 1386–1395. Association for Computational Linguistics
Pan SJ, Ni X, Sun J-T, Yang Q, Chen Z (2010) Cross-domain sentiment classification via spectral feature alignment. In: Proceedings of the 19th international conference on World wide web, pp 751–760. ACM
Pang B, Lee L (2005) Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In: Proceedings of the 43rd annual meeting on association for computational linguistics, pp 115–124. Association for Computational Linguistics
Pang B, Lee L, Vaithyanathan S (2002) Thumbs up?: sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 conference on Empirical methods in natural language processing, vol 10, pp 79–86. Association for Computational Linguistics
Pang B, Lee L et al (2008) Opinion mining and sentiment analysis. Found Trends® Inf Retr 2(1–2):1–135
Article Google Scholar
Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Saito T, Rehmsmeier M (2015) The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PloS One 10(3):e0118432
Article Google Scholar
Sedhai S, Sun A (2015) Hspam14: a collection of 14 million tweets for hashtag-oriented spam research. In: SIGIR, pp 223–232. ACM
Silva NFFD, Coletta LF, Hruschka ER (2016) A survey and comparative study of tweet sentiment analysis via semi-supervised learning. ACM Comput Surv (CSUR) 49(1):15
Google Scholar
Socher R, Perelygin A, Wu J, Chuang J, Manning CD, Ng A, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp 1631–1642
Spiliopoulou M, Ntoutsi E, Zimmermann M (2017) Opinion stream mining. In: Sammut C, Webb GI (eds) Encyclopedia of machine learning and data mining. Springer, Boston, MA
Google Scholar
Su J, Shirab JS, Matwin S (2011) Large scale text classification using semi-supervised multinomial naive bayes. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 97–104. Citeseer
Tapia PA, Velásquez JD (2014) Twitter sentiment polarity analysis: a novel approach for improving the automated labeling in a text corpora. In: International conference on active media technology, pp 274–285. Springer, Berlin
Toutanova K, Klein D, Manning CD, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology, vol 1, pp 173–180. Association for Computational Linguistics
Unnikrishnan V, Beyer C, Matuszyk P, Niemann U, Pryss R, Schlee W, Ntoutsi E, Spiliopoulou M (2018) Entity-level stream classification: exploiting entity similarity to label the future observations referring to an entity. In: 2018 IEEE 5th international conference on data science and advanced analytics (DSAA), pp 246–255. IEEE
Vakharia D, Lease M (2013) Beyond AMT: an analysis of crowd work platforms. arXiv preprint arXiv:1310.1672
Wagner S, Zimmermann M, Ntoutsi E, Spiliopoulou M (2015) Ageing-based multinomial naive bayes classifiers over opinionated data streams. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, Berlin, pp 401–416
Wang S, Manning CD (2012) Baselines and bigrams: Simple, good sentiment and topic classification. In: Proceedings of the 50th annual meeting of the association for computational linguistics: short papers, vol 2, pp 90–94. Association for Computational Linguistics
Xia R, Wang C, Dai X-Y, Li T (2015) Co-training for semi-supervised sentiment classification based on dual-view bags-of-words representation. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (Volume 1: Long Papers), vol 1, pp 1054–1063
Ye Q, Zhang Z, Law R (2009) Sentiment classification of online reviews to travel destinations by supervised machine learning approaches. Exp Syst Appl 36(3):6527–6535
Article Google Scholar
Yu L-C, Wang J, Lai KR, Zhang X (2017) Refining word embeddings for sentiment analysis. In: Proceedings of the 2017 conference on empirical methods in natural language processing, pp 534–539
Zhang M, Tang J, Zhang X, Xue X (2014) Addressing cold start in recommender systems: a semi-supervised co-training algorithm. In: Proceedings of the 37th international ACM SIGIR conference on research and development in information retrieval, pp 73–82. ACM
Zhao L, Huang M, Yao Z, Su R, Jiang Y, Zhu X (2016) Semi-supervised multinomial naive bayes for text classification by leveraging word-level statistical constraint. In: Thirtieth AAAI conference on artificial intelligence
Zhu X, Goldberg AB, Brachman R, Dietterich T (2009) Introduction to semi-supervised learning. Morgan and Claypool Publishers, Los Altos ISBN 1598295470, 9781598295474
Book Google Scholar
Zimmerann M, Ntoutsi E, Spiliopoulou M (2014) A semi-supervised self-adaptive classifier over opinionated streams. In: 2014 IEEE international conference on data mining workshop, pp 425–432. IEEE

Download references

Acknowledgements

The work was inspired by the German Research Foundation (DFG) Project (Grant No. 317686254) OSCAR (Opinion Stream Classification with Ensembles and Active leaRners) for which the last author is a Principal Investigator.

Author information

Authors and Affiliations

L3S Research Center, Leibniz University Hanover, Hanover, Germany
Vasileios Iosifidis & Eirini Ntoutsi

Authors

Vasileios Iosifidis
View author publications
You can also search for this author in PubMed Google Scholar
Eirini Ntoutsi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vasileios Iosifidis.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Iosifidis, V., Ntoutsi, E. Sentiment analysis on big sparse data streams with limited labels. Knowl Inf Syst 62, 1393–1432 (2020). https://doi.org/10.1007/s10115-019-01392-9

Download citation

Received: 12 November 2018
Revised: 29 July 2019
Accepted: 05 August 2019
Published: 17 August 2019
Issue Date: April 2020
DOI: https://doi.org/10.1007/s10115-019-01392-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sentiment analysis on big sparse data streams with limited labels

Abstract

Access this article

Similar content being viewed by others

Semi-Supervised Sentiment Analysis of Portuguese Tweets with Random Walk in Feature Sample Networks

Semi-supervised Sentiment Annotation of Large Corpora

On Assessing the Sentiment of General Tweets

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Sentiment analysis on big sparse data streams with limited labels

Abstract

Access this article

Similar content being viewed by others

Semi-Supervised Sentiment Analysis of Portuguese Tweets with Random Walk in Feature Sample Networks

Semi-supervised Sentiment Annotation of Large Corpora

On Assessing the Sentiment of General Tweets

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation