Skip to main content
Log in

Monitoring rare categories in sentiment and opinion analysis: a Milan mega event on Twitter platform

  • Original Paper
  • Published:
Statistical Methods & Applications Aims and scope Submit manuscript

Abstract

This paper proposes a new aggregated classification scheme aimed to support the implementation of semantic text analysis methods in contexts characterized by the presence of rare text categories. The proposed approach starts from the aggregate supervised text classifier developed by Hopkins and King and moves forward, relying on rare event sampling methods. In detail, it enables the analyst to enlarge the number of estimated sentiment categories, both preserving the estimation accuracy and reducing the working time to unconditionally increase the size of the training set. The approach is applied to study the daily evolution of the web reputation of one of the last mega-event taking place in Europe: Expo Milano. The corpus consists of more than one million tweets in both Italian and English, discussing about the event. The analysis provides an interesting portrayal of the evolution of the Expo stakeholders’ opinions over time and allows the identification of the main drivers of the Expo reputation. The algorithm will be implemented as a running option in the next release of the R package ReadMe.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Agosti M, Bacchin M, Ferro N, Melucci M (2002) Improving the automatic retrieval of text documents. In: Workshop of the cross-language evaluation forum for European Languages. Springer, pp 279–290

  • Aprosio AP, Moretti G (2016) Italy goes to stanford: a collection of corenlp modules for italian. arXiv preprint arXiv:1609.06204

  • Berger AL, Pietra VJD, Pietra SAD (1996) A maximum entropy approach to natural language processing. Comput Ling 22(1):39–71

    Google Scholar 

  • Blei DM (2012) Probabilistic topic models. Commun ACM 55(4):77–84

    Google Scholar 

  • Blei DM, Lafferty JD (2007) A correlated topic model of science. Ann Appl Stat 1(1):17–35

    MathSciNet  MATH  Google Scholar 

  • Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  • Bouchet-Valat M (2014) SnowballC: snowball stemmers based on the C libstemmer UTF-8 library. R package version 0.5.1

  • Breiman L, Friedman J, Stone C, Olshen R (1984) Classification and regression trees. The Wadsworth and Brooks-Cole statistics-probability series. Chapman & Hall, New York

    Google Scholar 

  • Breslow NE (1996) Statistics in epidemiology: the case–control study. J Am Stat Assoc 91(433):14–28

    MathSciNet  MATH  Google Scholar 

  • Ceron A, Curini L, Iacus SM (2015) Using social media to forecast electoral results: a review of state-of-the-art. Stat Appl Ital J Appl Stat 25(3):239–261

    Google Scholar 

  • Ceron A, Curini L, Iacus SM (2016) isa: a fast, scalable and accurate algorithm for sentiment analysis of social media content. Inf Sci 367:105–124

    Google Scholar 

  • Chen H, Chiang RH, Storey VC (2012) Business intelligence and analytics: from big data to big impact. MIS Q 36(4):1165–1188

    Google Scholar 

  • Choi D, Kim P (2013) Sentiment analysis for tracking breaking events: a case study on twitter. Asian conference on intelligent information and database systems. Springer, Berlin, pp 285–294

    Google Scholar 

  • Corallo A, Fortunato L, Matera M, Alessi M, Camillò A, Chetta V, Giangreco E, Storelli D (2015) Sentiment analysis for government: an optimized approach. In: Perner P (ed) Machine learning and data mining in pattern recognition. Springer, Cham, pp 98–112

    Google Scholar 

  • da Silva NF, Hruschka ER, Hruschka ER (2014) Tweet sentiment analysis with classifier ensembles. Decis Support Syst 66:170–179

    Google Scholar 

  • Das SR, Chen MY (2007) Yahoo! for Amazon: sentiment extraction from small talk on the web. Manag Sci 53(9):1375–1388

    Google Scholar 

  • Dave K, Lawrence S, Pennock DM (2003) Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: Proceedings of the 12th international conference on World Wide Web. ACM, New York, WWW ’03, pp 519–528

  • Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Science 41(6):391–407

    Google Scholar 

  • Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley, New York

    MATH  Google Scholar 

  • Erosheva E, Fienberg S, Lafferty J (2004) Mixed-membership models of scientific publications. Proc Natl Acad Sci 101(suppl 1):5220–5227

    Google Scholar 

  • ExpoMilano (2015) Expo Milano 2015: La sfida dell’italia per un’esplosione universale innovativa. www.expo2015.org

  • Feinerer I, Hornik K (2017) tm: Text Mining Package. R package version 0.7-3

  • Gentry J (2015) twitteR: R based Twitter Client. R package version 1.1.9

  • Go A, Bhayani R, Huang L (2009) Twitter sentiment classification using distant supervision. Nature 1(12):1–6

    Google Scholar 

  • Grimmer J, Stewart BM (2013) Text as data: the promise and pitfalls of automatic content analysis methods for political texts. Polit Anal 21(3):267–297

    Google Scholar 

  • Hand DJ (2006) Classifier technology and the illusion of progress. Stat Sci 21(1):1–14

    MathSciNet  MATH  Google Scholar 

  • Hopkins DJ, King G (2010) A method of automated nonparametric content analysis for social science. Am J Polit Sci 54(1):229–247

    Google Scholar 

  • Hopkins D, King G (2017) ReadMe: software for automated content analysis. R package version 0.99837

  • Inversini A, Marchiori E, Dedekind C, Cantoni L (2010) Applying a conceptual framework to analyze online reputation of tourism destinations. In: Gretzel U, Law R, Fuchs M (eds) Information and communication technologies in tourism 2010. Springer Vienna, Vienna, pp 321–332

    Google Scholar 

  • Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Nédellec C, Rouveirol C (eds) Machine learning: ECML-98. Springer, Berlin, pp 137–142

    Google Scholar 

  • King G, Zeng L (2001) Logistic regression in rare events data. Polit Anal 9(2):137–163

    Google Scholar 

  • Laver M, Benoit K, Garry J (2003) Extracting policy positions from political texts using words as data. Am Polit Sci Rev 97(2):311–331

    Google Scholar 

  • Liaw A, Wiener M (2015) Classification and regression by randomforest. R Cran Repository R package version 4.6-12

  • Lowe W (2008) Understanding wordscores. Polit Anal 16(4):356–371

    Google Scholar 

  • Mahalakshmi S, Sivasankar E (2015) Cross domain sentiment analysis using different machine learning techniques. In: Ravi V, Panigrahi BK, Das S, Suganthan PN (eds) Proceedings of the fifth international conference on fuzzy and neuro computing. Springer, Cham, FANCCO-2015, pp 77–87

  • Manning CD, Raghavan P, tze Hinrich S (2008) Introduction to information retrieval. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  • Martin LW, Vanberg G (2008) A robust transformation procedure for interpreting political text. Polit Anal 16(1):93–100

    Google Scholar 

  • Monroe BL, Maeda K (2004) Talk’s cheap: text-based estimation of rhetorical ideal-points. In: 21st annual meeting of the Society for Political Methodology, pp 29–31

  • Mudinas A, Zhang D, Levene M (2012) Combining lexicon and learning based approaches for concept-level sentiment analysis. In: Proceedings of the first international workshop on issues of sentiment discovery and opinion mining. ACM, New York, WISDOM ’12, pp 1–8

  • Mukherjee S, Bhattacharyya P (2013) Sentiment analysis : a literature survey. arXiv preprint arXiv:1304.4520

  • Müller M (2015) What makes an event a mega-event? Definitions and sizes. Leis Stud 34(6):627–642

    Google Scholar 

  • Nirmala CR, Roopa GM, Kumar KRN (2015) Twitter data analysis for unemployment crisis. In: 2015 international conference on applied and theoretical computing and communication technology. Davanagere, Karnataka, India. iCATccT, pp 420–423

  • Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retrivial 2(1–2):1–135

    Google Scholar 

  • Pang B, Lee L, Vaithyanathan S (2002) Thumbs up?: Sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 conference on empirical methods in natural language processing, vol 10. Association for Computational Linguistics, Stroudsburg, EMNLP ’02, pp 79–86

  • Ponzi LJ, Fombrun CJ, Gardberg NA (2011) Reptrak™ pulse: conceptualizing and validating a short-form measure of corporate reputation. Corp Reput Rev 14(1):15–35

    Google Scholar 

  • Rao Y, Lei J, Wenyin L, Li Q, Chen M (2014a) Building emotional dictionary for sentiment analysis of online news. World Wide Web 17(4):723–742

    Google Scholar 

  • Rao Y, Li Q, Mao X, Wenyin L (2014b) Sentiment topic models for social emotion mining. Inf Sci 266:90–100

    Google Scholar 

  • Rayner J (2004) Managing reputational risk: curbing threats, leveraging opportunities. Wiley, New York

    Google Scholar 

  • Ribeiro FN, Araújo M, Gonçalves P, André Gonçalves M, Benevenuto F (2016) Sentibench—a benchmark comparison of state-of-the-practice sentiment analysis methods. EPJ Data Sci 5(1):23

    Google Scholar 

  • Roberts ME, Stewart BM, Airoldi EM (2016) A model of text for experimentation in the social sciences. J Am Stat Assoc 111(515):988–1003

    MathSciNet  Google Scholar 

  • Salter-Townshend M, Murphy TB (2014) Mixtures of biased sentiment analysers. Adv Data Anal Classif 8(1):85–103

    MathSciNet  MATH  Google Scholar 

  • Slapin JB, Proksch SO (2008) A scaling model for estimating time-series party positions from texts. Am J Polit Sci 52(3):705–722

    Google Scholar 

  • Solari D, Sciandra A, Rinaldo M, Redaelli M, Finos L (2016) Textwiller: collection of functions for text mining, specially devoted to the Italian language. https://github com/livioivil/TextWiller

  • Sparck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21

    Google Scholar 

  • Stone PJ, Dexter CD, Smith MS, Ogilvie DM (1968) The general inquirer: a computer approach to content analysis. Am J Sociol 73(5):634–635

    Google Scholar 

  • Taboada M, Brooke J, Tofiloski M, Voll K, Stede M (2011) Lexicon-based methods for sentiment analysis. Comput Ling 37(2):267–307

    Google Scholar 

  • Tian F, Wu F, Chao KM, Zheng Q, Shah N, Lan T, Yue J (2016) A topic sentence-based instance transfer method for imbalanced sentiment classification of chinese product reviews. Electron Commerce Res Appl 16:66–76

    Google Scholar 

  • Tripathy A, Agrawal A, Rath SK (2016) Classification of sentiment reviews using n-gram machine learning approach. Expert Syst Appl 57:117–126

    Google Scholar 

  • Zhao H, Ji X, Zeng Q, Jiang S (2016) A teaching evaluation method based on sentiment classification. Int J Comput Sci Math 7(1):54–62

    Google Scholar 

  • Zhou Z, Zhang X, Sanderson M (2014) Sentiment analysis on twitter through topic-based lexicon expansion. In: Wang H, Sharaf MA (eds) Databases theory and applications. Springer, Cham, pp 98–109

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anna Calissano.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1

Proof of Theorem 1.

Proof

Consider the result of Theorem 1, the demonstration is shown for a general number of stems K and number of categories J. Consider S a multinomial variable assuming \(S_1,\ldots ,S_{2^K}\) possible values and D multinomial variable assuming \(D_1,\ldots ,D_J\) possible values. By definition, A is a \(2^K\times 2^K\) diagonal matrix and B is a \(J\times J\) diagonal matrix. In matrix terms, the Eq. (8) can be re-written as following:

$$\begin{aligned}&\left[ \begin{array}{ccc} A_{1 1} &{} &{} \\ &{} \ddots &{} \\ &{} &{} A_{2^K 2^K} \end{array}\right] \left[ \begin{array}{ccc} P^{RTr}(S=S_1|D=D_1) &{} \dots &{} P^{RTr}(S=S_1|D=D_J) \\ P^{RTr}(S=S_2|D=D_1) &{} \dots &{} P^{RTr}(S=S_2|D=D_J) \\ \vdots &{} \ddots &{} \vdots \\ P^{RTr}(S=S_{2^K}|D=D_1) &{} \dots &{} P^{RTr}(S=S_{2^K}|D=D_J) \end{array}\right] \left[ \begin{array}{ccc} B_{1 1} &{} &{} \\ &{} \ddots &{} \\ &{} &{} B_{J J} \end{array}\right] \\&\quad = \left[ \begin{array}{ccc} P(S=S_1|D=D_1) &{} \dots &{} P(S=S_1|D=D_J) \\ P(S=S_2|D=D_1) &{} \dots &{} P(S=S_2|D=D_J) \\ \vdots &{} \ddots &{} \vdots \\ P(S=S_{2^K}|D=D_1) &{} \dots &{} P(S=S_{2^K}|D=D_J) \end{array}\right] \\ \end{aligned}$$

Due to the matrixes’ structure, we can prove the equality component-wise, by considering the general component (ij):

$$\begin{aligned}{}[A_{ii}P^{RTr}(S=S_i|D=D_j)B_{jj}]_{ij}=[P(S=S_i|D=D_j)]_{ij} \end{aligned}$$

For seek of simplicity, we miss the (ij) subscripts along the demonstration.

Consider the left side of the equality, substitute \(A_{ii}\), and apply Bayes Formula:

$$\begin{aligned}&\dfrac{P^{Tr}(S=S_i)P^{RTr}(S=S_i|D=D_j)B_{jj}}{P^{RTr}(S=S_i)}\\&\quad =\dfrac{P^{Tr}(S=S_i)P^{RTr}(D=D_j|S=S_i)P^{Rtr}(S=S_i)B_{jj}}{P^{RTr}(S=S_i)P^{RTr}(D=D_j)} \end{aligned}$$

Substituting \(B_{j j}\) :

$$\begin{aligned}&\dfrac{P^{Tr}(S=S_i)P^{RTr}(D=D_j|S=S_i)}{P^{RTr}(D=D_j)\sum \limits _{n=1}^{2^K}{P^{RTr}(S=S_n|D=D_j)A_n}}\\&\quad =\dfrac{P^{Tr}(S=S_i)P^{RTr}(D=D_j|S=S_i)}{P^{RTr}(D=D_j)\sum \limits _{n=1}^{2^K}{\dfrac{P^{RTr}(D=D_j|S=S_n)P^{Tr}(S=S_n)}{P^{RTr}(D=D_j)}}} \end{aligned}$$

Using the hypothesis \(P^{RTr}(D|S)=P^{Tr}(D|S)\) and the law of total probability:

$$\begin{aligned}&\dfrac{P^{Tr}(S=S_i)P^{Tr}(D=D_j|S=S_i)}{\sum \nolimits _{n=1}^{2^K}P^{Tr}(D=D_j|S=S_n)P^{Tr}(S=S_n)}\\&\quad =\dfrac{P^{Tr}(S=S_i)P^{Tr}(D=D_j|S=S_i)}{P^{Tr}(D=D_j)}\\&\quad =P^{Tr}(S=S_i|D=D_j) \end{aligned}$$

For hypothesis:

$$\begin{aligned} P^{Tr}(S=S_i|D=D_j)=P(S=S_i|D=D_j) \end{aligned}$$

So we can write the following:

$$\begin{aligned}{}[A_{ii}P^{RTr}(S=S_i|D=D_j)B_{jj}]_{ij}=[P^{Tr}(S=S_i|D=D_j)]_{ij}=[P(S=S_i|D=D_j)]_{ij} \end{aligned}$$

Appendix 2

List of opinion and sentiment categories defined for the analysis of web-reputation of Expo Milan (Tables 1, 2).

Table 1 Sentiment analysis: description of the sentiment categories
Table 2 Opinion analysis: description of the positive (negative) opinion categories. Every positive category has a corresponding negative one. Neutral, Off-Topics, and Advertise categories are also estimated in opinion analysis

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Calissano, A., Vantini, S. & Arena, M. Monitoring rare categories in sentiment and opinion analysis: a Milan mega event on Twitter platform. Stat Methods Appl 29, 787–812 (2020). https://doi.org/10.1007/s10260-019-00504-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10260-019-00504-7

Keywords

Mathematics Subject Classification

Navigation