1 Introduction

Traditionally, in domains such as market research, user subjectivity has been accessed using qualitative techniques such as surveys, interviews and focus groups. The proliferation of user-generated content on the Web 2.0 provides new opportunities for capturing people's appraisals, feelings and opinions. However, the sheer scale of text data generated on the Web poses obvious practical challenges of classifying user subjectivity using traditional qualitative techniques. Text mining has emerged as a potential solution for overcoming information overload associated with reading vast amounts of text from diverse sources. Recently, sentiment analysis has emerged as an approach that aims to automatically extract and classify sentiment (the subjective component of an opinion) and/or emotions (the projections or display of a feeling) expressed in text (Liu 2010; Munezero et al. 2014).Research in this domain has focused on the problem of sentiment analysis by classifying opinionated text segments (e.g. phrase, sentence or paragraph) in terms of positive or negative polarity, e.g. (Aue and Gamon 2005; Bethard et al. 2004; Breck et al. 2007). The problem with sentiment polarity is that it combines diverse emotions into two classes. For example, negative polarity conflates sadness, fear and anger. Some domains require further differentiation to associate specific emotions with appropriate actions. For example, in monitoring counter-terrorism issues, sadness, fear and anger may require a different targeted response, e.g. counseling, media communication and anti-radicalization. The problem of classifying public reaction to terrorist activities on social media indicates a need for specificity of emotion classification rather than sentiment polarity.

In this study, we compared six emotion classification schemes, including six basic emotions (Ekman 1971), wheel of emotion (Plutchik 1980), Circumplex theory of affect (Watson and Tellegen 1985), EARL (HUMAINE (Human-Machine Interaction Network on Emotion) 2006), WordNet–Affect (Valitutti et al. 2004) and free text classification scheme. To measure their utility, we investigated their ease of use by human annotators as well as performance of supervised machine learning when such schemes are used to annotate the training data. The study was designed as follows: (1) select a representative set of emotion classification schemes, (2) assemble a corpus of emotionally charged text documents, (3) use a crowdsourcing approach to manually annotate these documents under each scheme, (4) compare the schemes using inter-annotator agreement, and interview a selected group of annotators about their perceptions of each scheme, and (5) compare the classification performance of supervised machine learning algorithms trained on the data annotated under each scheme.

2 Emotion Classification Schemes

The main tension in the literature is whether emotions can be defined as discrete, universal categories of basic emotions, whether they are characterized by one or more dimensions, or whether they are organized hierarchically. Here we discuss five examples of classification schemes, which are summarized in Table 1.

Table 1 A sample of emotion classification schemes

Categorical approaches are usually theory-driven accounts that suggest basic emotions are the functional expression of underlying biological and evolutionary processes (Damasio 2000; Darwin et al. 1998; LeDoux 1998). This view is supported by empirical findings of cross-cultural studies where recognition of facial expressions identified six basic emotions: anger, disgust, fear, happiness, sadness and surprise (Ekman 1971). Basic emotions provide a simple classification scheme, which has been used in many studies on emotive language analysis, e.g. (Aman and Szpakowicz 2007; Das and Bandyopadhyay 2012; Mohammad 2012; Strapparava and Mihalcea 2008).

Wheel of emotion (Plutchik 1980) (Fig. 1) is a model that uses color to illustrate intensity of emotions and their relationships. At the centre of this model are eight basic emotions: joy, trust, fear, surprise, sadness, disgust, anger and anticipation. The emotion space is represented so that combinations of basic emotions derive secondary emotions (e.g. joy + trust = love, anger + anticipation = aggression, etc.). Emotion intensity is represented by color boldness, e.g. annoyance is less intense whereas rage is more intense than anger.

Fig. 1
figure 1

Wheel of emotion (Plutchik 1980)

Dimensional approaches represent emotions as coordinates in a multi-dimensional space (Cambria et al. 2012). There is considerable variation among these models, many of which are formed by two or three dimensions (Rubin and Talarico 2009), which incorporate aspects of arousal and valence (e.g. (Russell 1979)), evaluation and activation (e.g. (Whissell 1989)), positive and negative (e.g. (Watson and Tellegen 1985)), tension and energy (e.g. (Thayer 1997)), etc. When faceted information is needed for tasks such as emotive language analysis, dimensional models are appealing because they contain a relatively small set of categories, e.g. (Ovesdotter Alm and Sproat 2005; Strapparava and Mihalcea 2008; Cambria et al. 2012). Circumplex theory of affect (Watson and Tellegen 1985) (Fig. 2) incorporates four dimensions corresponding to positive affect, engagement, negative affect and pleasantness, each having two directions: high and low. Specific emotions are classified into one of eight categories on this scale. For example, excitement is classified as having high positive affect, calmness as having low negative affect, etc. Circumplex has been suggested as a useful model for quantifying and qualitatively describing emotions identified in text (Rubin et al. 2004).

Fig. 2
figure 2

Circumplex theory of affect (Watson and Tellegen 1985)

EARL is a formal language for representing emotions in technological contexts (HUMAINE (Human-Machine Interaction Network on Emotion) 2006). Unlike schemes derived from psychological theory, EARL has been designed for a wide range of tasks in affective computing, including corpus annotation and emotion recognition. Similarly to Circumplex, EARL organizes emotions as primarily positive and negative, which are further refined based on intensity and attitude. There are five positive and five negative categories, and like Circumplex, specific emotions are given as representative examples for each category. For example, agitation is exemplified by shock, stress and tension (see Table 2).

Table 2 Emotion annotation representation language

The capacity of human cognition is dependent on the type and quantity of information stored in working memory. For instance, memory retention is generally shorter for longer words and longer for shorter words (Miller 1956). Humans often use hierarchical approaches to navigate a complex conceptual space by compartmentalizing options at different levels. Unlike other schemes that contain a small, but manageable set of categories, affective hierarchies, e.g. (Laros and Steenkamp 2005; Shaver et al. 1987; Storm and Storm 1987), capture a richer set of emotions, focusing on lexical aspects that can support text mining applications. In affective hierarchies, related emotions are grouped into classes starting with positive and negative affect as top-level classes. Basic emotions (e.g. happiness, sadness, love, anger, etc.) are typically found at the next level of specialization. The lowest level represents instances of individual emotions (e.g. optimistic, miserable, passionate) (Russell and Barrett 1999).

WordNet is a lexical database of English nouns, verbs, adjectives and adverbs grouped together into sets of interlinked synonyms known as synsets (Miller 1995). WordNet has been used as a lexical resource for many text mining applications, e.g. (Agarwal et al. 2011; Fast et al. 2015; Sedding and Kazakov 2004). WordNet–Affect (Valitutti et al. 2004) was created specifically as a lexical model for classifying affects, such as moods, situational emotions, or emotional responses, either directly (e.g. joy, sad, happy, etc.) or indirectly (e.g. pleasure, hurt, sorry, etc.). It was formed by aggregating a subset of WordNet synsets into an affect hierarchy (see Fig. 3). WordNet–Affect has been used as a lexical resource to support many sentiment analysis studies, e.g. (Balahur et al. 2010; Strapparava and Mihalcea 2008).

Fig. 3
figure 3

An excerpt from the WordNet–Affect hierarchy

3 Data Collection

3.1 Text Corpus

Emotive language analysis has been applied to a range of texts from different domains. Studies have focused on emotions expressed in web logs, e.g. (Généreux and Evans 2006; Mihalcea and Liu 2006; Neviarouskaya et al. 2009), fairy tales, e.g. (Ovesdotter Alm and Sproat 2005; Francisco and Gervás 2006), novels, e.g. (Boucouvalas 2002; John et al. 2006), chat messages, e.g. (Zhe and Boucouvalas 2002; Ma et al. 2005), e-mails, e.g. (Liu et al. 2003), Twitter posts, e.g. (Tumasjan et al. 2010; Agarwal et al. 2011), etc. Twitter is a social networking service that enables users to send and read tweets – text messages consisting of up to 140 characters. Twitter provides an open platform for users from diverse demographic groups. An estimated 500 million tweets gets posted each day (Haustein et al. 2016). Information content of tweets varies from daily life updates, sharing content (e.g. news, music, articles, etc.), expressing opinions, etc. The use of Twitter as a means of self-disclosure makes it a valuable source of emotionally-charged text and a popular choice for sentiment analysis studies, e.g. (Go et al. 2009; Pak and Paroubek 2010; Kouloumpis et al. 2011). For these reasons, Twitter was selected as a source of data in the present study.

We assembled a corpus of 500 self-contained tweets, i.e. those that did not appear to be a part of a conversation. More precisely, we excluded re-tweets, replies as well as tweets that contained URLs or mentioned other users to maximize the likelihood of an emotion expressed in a tweet to refer to the tweet itself and not an external source (e.g. content corresponding to a URL). We used four criteria to identify emotionally-charged tweets. They were based on the use of idioms, emoticons and hashtags as well as automatically calculated sentiment polarity. The remainder of this section provides more detail on selection criteria.

In previous studies, we demonstrated the value of idioms as pertinent features in sentiment analysis (Williams et al. 2015; Spasić et al. 2017). We found that idiom-based features significantly improve sentiment classification results. Using a set of emotionally-charged idioms described in the original study, we collected 100 tweets containing references to such idioms. The following is an example of a tweet with an idiom presented in italic typeface: "If I see a mouse in this house I will go ballistic."

Written online communication has led to the emergence of informal, sometimes ungrammatical, textual conventions (Purver and Battersby 2012) used to compensate for the absence of body language and intonation, which otherwise account for 93% of non-verbal communication (Mehrabian 1972). Emoticons are pictorial representations of facial expressions that seem to compensate for the lack of embodied communication. For example, the smiley face :) is commonly used to represent positive emotions. We collected 100 tweets containing emoticons. Table 3 summarizes the distribution of emoticons across these tweets.

Table 3 Distribution of emoticons across 100 tweets

Hashtags, i.e. words or unspaced phrases prefixed with a pound sign (#), are commonly used by Twitter users to add context and metadata to the main content of a tweet, which in turn makes it easier for other users to find messages on a specific topic (Chang 2010). Hashtags are sometimes used to flag the users' emotional state (Wang et al. 2011), which can be seen in the following example: "Sometimes I just wonder. . . I don't know what to think #pensive". To systematically search Twitter for emotive hashtags, we used WordNet–Affect as a comprehensive lexicon of emotive words. Our local version of the lexicon consists of 1,484 words including all derivational and inflectional forms of the word senses originally found in WordNet–Affect. We searched Twitter using these surface forms as hashtags to collect 100 tweets. The hashtags were subsequently removed from the original tweets for the following two reasons. First, we wanted the annotators to infer the emotion themselves from the main content. Second, we did not want to skew the inter-annotator agreement in favor of the WordNet–Affect as a classification scheme.

Another strategy for identifying emotionally-charged tweets involved automatically calculated sentiment polarity. We collected 116,903 tweets randomly and processed them with a sentiment annotator distributed as part of the Stanford CoreNLP (Socher et al. 2013), a suite of natural language processing tools. This method uses recursive neutral networks to perform sentiment analysis at all levels of compositionality across the parse tree by classifying a subtree on a 5-point scale: very negative, negative, neutral, positive and very positive. Figure 4 provides an example classified as very negative. We used the sentiment analysis results to select a random subset of 50 very positive and 50 very negative tweets.

Fig. 4
figure 4

An example of sentiment analysis results

Finally, we collected 100 additional tweets at random to include emotionally neutral or ambiguous tweets while correcting for bias towards certain emotions based on the choice of idioms, emoticons and hashtags. Table 4 summarizes the corpus selection criteria and distribution of the corresponding tweets selected for inclusion in the corpus.

Table 4 Corpus selection criteria and distribution

3.2 Manual Annotation of Emotional Content

Crowdsourcing has become a popular method of quickly obtaining large training datasets to support supervised machine learning approaches for a variety of text mining applications including sentiment analysis, e.g. (Purver and Battersby 2012; Taboada et al. 2011). Web platforms such as CrowdFlower (CrowdFlower 2016) or Mechanical Turk (Amazon 2016) allow users to set up and distribute crowdsourcing jobs to millions of online contributors.

We used CrowdFlower to annotate text documents with respect to their emotional content. A bespoke annotation interface was designed, which consisted of three parts: input text, an annotation menu based on a classification scheme and, where appropriate, a graphical representation of the classification scheme to serve as a visual aid (see Fig. 5 for an example). To mitigate the complexity of WordNet–Affect, we implemented autocomplete functionality, where matching items from the lexicon were automatically suggested as the annotator typed into a free text field. We introduced a neutral category into all classification schemes to allow for the annotation of flat or absent emotional response. For example, "Fixing my iTunes library." was annotated as neutral by 23 of 30 annotators. Similarly, we introduced an ambiguous category to allow for annotation of cases where an emotion is present, but indeterminate in the absence of context, intonation or body language. For example, the use of punctuation in "What a day!!!!!" clearly indicates an emotional charge, but is unclear whether the statement is positive or negative.

Fig. 5
figure 5

An annotation example

In addition to the schemes discussed in Section 2, we also included free text classification, where the choice of annotations was unrestricted. We specifically wanted to investigate whether a folksonomy naturally emerging from annotators' free text choices could give rise to a suitable emotion classification scheme.

Having set up 6 annotation jobs, one for each classification scheme, contributors were asked to annotate each text document with a single class that best described its emotional content. A total of 189 annotators participated in the study. Given a classification scheme, each document was annotated by five independent annotators. In total, 15,000 annotations (500 documents × 6 schemes × 5 annotations) were collected. The distributions of annotations across the schemes are shown in Fig. 6, with WordNet–Affect and free text charts displaying the distributions of the top 20 most frequently used annotations.

Fig. 6
figure 6

Distribution of annotations across each scheme

4 Utility Analysis: a Human Perspective

4.1 Quantitative Analysis of Annotation Results

4.1.1 Inter-Annotator Agreement

The main goal of this study was to identify an appropriate emotion classification scheme in terms of completeness and complexity, thereby minimizing the difficulty in selecting the most appropriate class for an arbitrary text example. We hypothesize that when a correct class is available, unambiguous and readily identifiable, then the likelihood of independent annotators selecting that particular class increases, thus leading to higher inter-annotator agreement (IAA).

We used Krippendorff's alpha coefficient (Krippendorff 2013) to measure the IAA. As a generalization of known reliability indices, it was chosen because it applies to: (1) any number of annotators, not just two, (2) any number of classes, and (3) corrects for chance expected agreement. Krippendorff's alpha coefficient of 1 indicates perfect agreement, whereas 0 indicates chance agreement. Therefore, higher values indicate better agreement. We calculated Krippendorff's alpha coefficient values using an online tool (Geertzen 2016). The results are shown in Fig. 7, which also includes values of adjusted Rand index (Hubert and Arabie 1985; Steinley 2004; Steinley et al. 2016) as an alternative measure of agreement. Krippendorff's alpha coefficient of α = 0.667 has been suggested as the trustworthy threshold of data reliability (Krippendorff 2004). With the highest value of 0.483, the IAA results in this study are well below this threshold, which is consistent with other studies on affective annotation (Devillers et al. 2005; Callejas and López-Cózar 2008; Antoine et al. 2014).

Fig. 7
figure 7

Inter-annotator agreement results

Here we discuss potential reasons for low IAA. Annotation is a highly subjective process that varies with age, gender, experience, cultural location and individual psychological differences (Passonneau et al. 2008). Additionally, a text document may consist of multiple statements, which may convey different or competing emotional content. For example, there are two statements in the following sentence: "On train going skating :) Hate the rain :(," each associated with a different emotion illustrated clearly by the use of emoticons. Using the wheel of emotion, this sentence received the following five annotations: sadness, sadness, joy, love, ambiguous. It can be inferred that annotators 1 and 2 focused on the latter statement, whereas annotators 3 and 4 focused on the former statement. Annotator 5 acknowledged the presence of both positive and negative emotions by classifying the overall text ambiguous. A genuine ambiguity occurs when the underlying emotion may be interpreted differently in different contexts (e.g. "Another week off," received two ambiguous and three joy annotations), which leads to inter-annotator disagreement. Other factors such as annotators' skills and focus, the clarity of the annotation guidelines and inherent ambiguity of natural language may have also contributed to low IAA. These factors may explain low IAA, but fail to explain large variation in agreement across the annotation schemes, which ranged from 0.202 to 0.483 with a standard deviation of 11.2. Nonetheless, these results enabled a comparison of different schemes.

Unsurprisingly, given the smallest number of options, the highest IAA (α = 0.483) and the highest number of unanimous agreements (175 out of 500, i.e. 35%) were recorded for six basic emotions. An important factor to consider here is that this scheme incurred by far the highest usage of neutral and ambiguous annotations – 576 out of 2500 (23%). This may imply that the scheme of six basic emotions has insufficient coverage of the emotion space.

Intuitively, one may expect IAA to be higher for schemes with fewer classes, as seen in some empirical studies (Antoine et al. 2014), because fewer choices offer fewer chances for disagreement. However, Krippendorff's alpha coefficient is a chance corrected measure of IAA, which suggests this may not necessarily be the case. Specifically, our study shows higher agreement for a scheme with 18 categories (the wheel of emotion) than it does for schemes of 10 or 12 classes (EARL and Circumplex). With α = 0.41, the wheel of emotion recorded the second highest IAA. In comparison to six basic emotions, annotators resorted less frequently to using neutral and ambiguous annotations (see Fig. 7). It also recorded the second highest number of unanimous agreements (119 out of 500, i.e. 24%).

Dimensional schemes, Circumplex and EARL, both with similar number of classes (12 and 10), recorded similar levels of IAA (α = 0.312 and α = 0.286 respectively). However, an important difference between these schemes was the usage of neutral and ambiguous annotations. Circumplex incurred the second highest usage of these annotations. On the other hand, EARL had the second lowest usage of these annotations following free text annotations. This implies that with 10 generic categories, this scheme provides better coverage of the emotion space.

Due to the ambiguity and polysemy of natural language, lexical schemes, WordNet–Affect and free text, recorded the lowest IAA (α = 0.202 and α = 0.205 respectively) and incurred the fewest unanimous agreements (16 and 22 out of 500, i.e. 3% and 4% respectively). The lower IAA for WordNet–Affect may be explained by the difficulty of navigating a large hierarchy. With 262 and 260 different annotations recorded, WordNet–Affect and free text covered a wide range of emotive expressions, which provided annotators with the means of referring to a specific emotion when a suitable generic category was not available in other schemes, thus minimizing the use of ambiguous annotations.

To determine the significance of the differences in IAA across the schemes, we constructed confidence intervals for the given values. Given an unknown distribution of the Krippendorff's alpha coefficient, the best way to construct confidence intervals by estimation is to use bootstrap (Efron and Tibshirani 1994). We used 1000 replicate re-samples from the 500 tweets. Specifically, we randomly selected instances from the original set of 500 tweets to be included in a sample. The sampling was performed with replacement and, therefore, when a single tweet was included multiple times into the same sample, we re-used the same annotations. We then used the percentage method (Davison and Hinkley 1997) to construct 95% confidence intervals by cutting 2.5% of the replicates on each end. The confidence intervals where as follows: six basic emotions (0.4498, 0.5146), the wheel of emotion (0.3809, 0.4372), Circumplex (0.2871, 0.3348), EARL (0.2602, 0.3073), WordNet Affect (0.1826, 0.2196) and free text (0.1842, 0.2250). Where there is no overlap between the confidence intervals (see Fig. 8), we can assume that there is statistically significant difference on the IAA between the two schemes. Therefore, six basic emotions have the significantly higher IAA than all other schemes and the wheel of emotion has significantly larger agreement than Circumplex, EARL, WordNet Affect and free text. Circumplex and EARL have similar IAA, but significantly larger than WordNet Affect and free text. The last two schemes demonstrated similar IAA.

Fig. 8
figure 8

Confidence intervals for the inter-annotator agreement

4.1.2 Establishing the Ground Truth

Emotive language analysis tasks such as subjectivity or sentiment classification can be automated using machine learning, lexicon-based or hybrid approaches (Ravi and Ravi 2015). The accuracy of such methods is then tested against the ground truth. In addition, supervised learning approaches require the ground truth for training purposes. When the ground truth is not readily available, human experts are asked to annotate the data. The most frequent annotation per data item is then commonly accepted as the ground truth with an expectation for the automated system to behave as the majority of human annotators. In this study, we followed the same approach. For each classification scheme, an annotation agreed by the relative majority of at least 50% was assumed to be the ground truth (see Fig. 9 for distribution of ground truth annotations). For example, using six basic emotions as the classification scheme, the sentence "For crying out loud be quiet" was annotated with anger four times and once with disgust, thus anger was accepted as the ground truth.

Fig. 9
figure 9

Distribution of ground truth annotations

When no majority annotation could be identified, a new independent annotator resolved the disagreement. Table 5 (across the diagonal) provides the percentage of text instances that required disagreement resolution under each scheme. The remaining values illustrate the overlap of such text instances across the schemes. Overall 18 instances (i.e. 3.6%) required disagreement resolution under all schemes. Instances that required disagreement resolution under many schemes are likely to be genuinely ambiguous. Otherwise, the ambiguity is likely to be related to a given annotation scheme. In that sense, we wanted to investigate the relationships between the schemes. We performed multidimensional scaling over the data given in Table 5. Its results suggested that two directions account for 88% of the variation, so we used them to visualize the similarity of classification schemes in terms of underlying ambiguities (see Fig. 10). The first direction (along the x –axis) separates WordNet–Affect and free text from the remaining schemes. The second direction (along the y–axis) separates EARL and the wheel of emotion from the other four schemes. Both directions show the similarity between six basic emotions and Circumplex as well as the similarity between EARL and the wheel of emotion. As expected, WordNet–Affect and free text are far away from the rest and from each other indicating a much higher degree of inter-annotator disagreement.

Table 5 The percentage of instances that required disagreement resolution
Fig. 10
figure 10

Multidimensional scaling results

4.1.3 Correspondence Analysis

To illustrate the difference in the coverage of different schemes, let us consider annotations of the sentence "I'll always have a soft spot in my heart for this girl," (see Table 6). Despite the unanimous agreement under six basic emotions, it is still difficult to interpret the given sentence as an expression of happiness. Where love or related emotions are available, we can see a strong preference towards choosing such emotions (wheel of emotion, EARL, WordNet- Affect and free text). This point is re-enforced in the case of Circumplex, which similarly lacks a category related to love.

Table 6 Examples of annotation preferences

To generalize these observations and to explore the relationships between the classes across different schemes, we conducted correspondence analysis (Hirschfeld 1935), a dimension reduction method appropriate for categorical data. It is used for graphical representation of the relationships between two sets of categories. The large number of classes in WordNet–Affect and free text classification makes graphical representation of correspondence analysis involving either of these schemes highly convoluted. We, therefore, only present the results involving the four remaining schemes. For the analysis we used the ground truth annotations (see Section 4.1.2 for more details) and compared them between two schemes at a time. Figures 11, 12, 13, 14, 15, and 16 show the first two dimensions in correspondence analysis between the schemes.

Fig. 11
figure 11

Six basic emotions (blue) versus the wheel of emotion (red)

Fig. 12
figure 12

Six basic emotions (blue) versus Circumplex (red)

Fig. 13
figure 13

Six basic emotions (blue) versus EARL (red)

Fig. 14
figure 14

The wheel of emotion (blue) versus Circumplex (red)

Fig. 15
figure 15

The wheel of emotion (blue) versus EARL (red)

Fig. 16
figure 16

Circumplex (blue) versus EARL (red)

From the correspondence analysis between six basic emotions and Plutchik's wheel (Fig. 11), we can see that the first dimension separates positive emotions (e.g. happiness, love) on the left from the negative ones (e.g. anger, sadness) on the right. One can claim that the second direction differentiates between aggressive emotions (e.g. anger, aggressiveness) and more passive emotions (e.g. sadness, fear). If we further study the distribution of emotions across the two dimensions, we can see that four emotions from Plutchik's wheel, namely submission, joy, love and awe, correspond to a single basic emotion – happiness. Emotions that exist in both schemes are located close together in the graph, e.g. anger in the basic emotions scheme is close to anger in the wheel of emotion and the same applies to surprise, fear, sadness and disgust. On the other hand, it seems that emotions like remorse, anticipation, optimism, disapproval, trust and aggressiveness, which exist only in the wheel of emotion, do not correspond closely to a specific basic emotion. This supports evidence that these emotions are not redundant, i.e. cannot be abstracted easily into a basic emotion. Moreover, further analysis of the annotations across the two schemes shows that the annotators often resorted to happiness as the only positive basic emotion as a surrogate for a diverse range of emotions found in Plutchik's wheel including awe, submission and love, which do not necessarily imply happiness.

From the correspondence analysis between six basic emotions and Circumplex (Fig. 12), we can see that two positive classes in Circumplex, high positive affect and pleasantness, correspond to the basic emotion of happiness. On the other hand, some classes from Circumplex, e.g. strong engagement, low positive affect and low negative affect are not particularly close to any basic emotion. Similarly, the correspondence analysis between six basic emotions and EARL (Fig. 13), shows that all positive classes correspond to happiness, negative thoughts correspond to fear, negative and forceful corresponds to anger, whereas both negative and passive and negative and not in control correspond to sadness.

Figures 14 and 15 show how emotions from Plutchik's wheel relate to classes from Circumplex and EARL respectively. It is clear that even though Circumplex and EARL are richer than six basic emotions, they still do not seem to model emotions from Plutchik's wheel completely and unambiguously. For example, we can see from Fig. 14 that Circumplex does not have a class that corresponds to a number of emotions in Plutchik's wheel, e.g. optimism, trust, anticipation, remorse and disapproval. Similarly, from Fig. 15 we can see that classes from EARL do not align well against love, surprise, anticipation and aggressiveness. Finally, with few exceptions, Fig. 16 illustrates a clear alignment between classes in Circumplex and EARL, suggesting that they cover and partition the semantic space of emotions in a similar way.

To conclude, both Circumplex and EARL provide generic classes, which align fairly well (see Fig. 16). Not surprisingly, they achieved similar IAA, which is not statistically different (see Fig. 10). When compared with the schemes that use specific emotions rather than generic classes, i.e. six basic emotions and the wheel of emotion, neither of the generic schemes seem to model surprise well (see Figs. 12, 13, 14, and 15). In addition, even though EARL explicitly lists love as an example of the class caring, comparison with the wheel of emotion shows no strong correspondence between the two (see Fig. 15). The best results were seen with schemes that use specific emotions, with six basic emotions demonstrating significantly better IAA agreement than the wheel of emotion. However, six basic emotions require wider range of positive emotions. Figures 11 and 13 indicate that happiness is consistently used as surrogate for love. We, therefore, suggest expanding six basic emotions with love when using it as a classification scheme for emotive language analysis.

4.1.4 Taxonomy Versus Folksonomy

The use of WordNet–Affect and free text gave rise to a relatively large number of distinct annotations, which made their use in correspondence analysis impractical. However, we still wanted to explore the difference in the lexical expression of emotion depending on whether their choice was restricted or not. In effect, we can view WordNet–Affect as a taxonomy whose vocabulary is used to ensure consistent annotation. Its hierarchical structure also allows us to compare these annotations in terms of their semantic similarity. However, taxonomies, which are typically defined by domain experts, do not necessarily reflect user vocabulary (Kiu and Tsui 2011). The lack of appropriate taxonomies and the rapidly increasing volume of user-generated information on the Web have given rise to folksonomies. Annotation choices, which are not restricted to a predefined vocabulary, allow folksonomies to emerge in a bottom-up manner (Laniado et al. 2007). Much needed flexibility and freedom for users to annotate information according to their own preferences may make folksonomies inferior to taxonomies in terms of their ability to support search and browse functions mainly because of their flat structure. Therefore, much effort has been put into organizing folksonomies hierarchically (Laniado et al. 2007) or hybridizing them with taxonomies (Kiu and Tsui 2011).

In this study, we included free text annotation to investigate whether a folksonomy naturally emerging from annotators' free text choices could give rise to a suitable emotion classification scheme. To impose a structure on this folksonomy, we aligned it against WordNet–Affect. The two sets of annotations overlapped on a total of 2,107 (42%) individual annotations, which corresponded to 74 distinct words and 63 tree nodes in WordNet–Affect. We extracted the corresponding subtree from WordNet Affect and used the frequency with which individual nodes were used to further prune the tree by merging rarely used ones with their closest ancestor. We then analyzed 247 (5%) free text annotations not found in WordNet–Affect, which corresponded to 127 distinct words. Most of these words were only used five times or less and were excluded from further consideration. We analyzed the remaining five words and tried to map them onto previously extracted hierarchy. Two frequently used annotations humor and funny, were mapped onto an existing node – amusement. The other three frequently used annotations, ill, tired and exhausted, were merged into a single concept – fatigue, which was then added to the hierarchy. As a result, we organised the folksonomy into a hierarchy of 12 positive and 14 negative emotions (see Figs. 17 and 18). This has reduced the original WordNet–Affect hierarchy from a total of 278 nodes and 11 levels to a manageable hierarchy of 27 nodes and 5 levels.

Fig. 17
figure 17

A folksonomy of positive emotions

Fig. 18
figure 18

A folksonomy of negative emotions

4.2 Qualitative Analysis of Annotators' Perceptions

In order to gain a qualitative insight into how human annotators interpret and use the schemes, we conducted semi-structured interviews with 6 participants who had an academic background in social sciences. The annotation guidelines were explained to participants. Each participant was given a different sample of five text documents to annotate. They annotated the sample six times, once for each classification scheme. The order in which schemes were used for annotation was randomized for each participant. Their experiences were then discussed in a semi-structured interview. Table 7 provides the semi-structured interview guide. The interviews were recorded and transcribed verbatim. We conducted thematic analysis of the transcripts. The extracted themes (see Table 8) were related to annotators (subjectivity and certainty), data (context, ambiguity and multiplicity) and the schemes (coverage and complexity).

Table 7 Semi-structured interview guide
Table 8 The summary of thematic analysis

Generally, participants found the annotation task difficult, often not feeling confident about their choice. Annotators agreed that features such as punctuation (e.g. !) and words with strong sentiment polarity (e.g. beautiful, amazing, disgusting and horrible) were strong indicators of an emotion. The annotation choices for utterances that conveyed multiple emotions varied greatly across the annotators. For example, "My dress is so cute ugh. Praying no one wears the same one or else I will go ballistic," was interpreted to express both a positive and a negative emotion.

When context was absent (e.g. "Please stay away"), participants required more time to find an appropriate annotation. Annotators found themselves reading the text with different intonation in order to re-contextualize the underlying emotion. Upon failing to identify the context, annotators doubted their original annotation choice, claiming they may have over-compensated for the lack of context.

One significant factor affecting annotation was the number of classes available in a scheme. In particular, for six basic emotions, annotators found that the classes were meaningful or relevant for distinguishing among the polarity of the text, but not the types of emotion expressed. This is consistent with the results of correspondence analysis described in Section 4.1, which decried happiness as a poor surrogate for love. The insufficient coverage of the emotion space in this scheme significantly restricted the choices, resulting in poor capture of the primary emotion conveyed. It became unsatisfying for participants to annotate with a class that was not fit for purpose, i.e. classes did not map easily onto the emotional content. For example, "So proud of myself right now :D," was annotated as happiness, but also construes pride, an emotion that is distinct from happiness (Sullivan 2007).

When faced with the wheel of emotion the annotators found it difficult to choose between related emotions. For example, annotators debated whether "I'll always think the world of you," expressed love, trust or awe. In this case, annotators agreed that having multiple options in terms of intensity or similarity would be more appropriate. The structure of the wheel received a negative response. Annotators agreed that it contained too much information and was quite complex to understand without additional explanation. There was debate that some emotions in the wheel (e.g. trust) are not necessarily emotions, but states, and questioned some emotion combinations (e.g. anticipation + joy = optimism, sadness + surprise = disapproval). Annotators felt that the wheel of emotion, in comparison to six basic emotions, provides better coverage, but lacks the ability to encode some emotions. For example, annotators required an emotion to represent discomfort for "My throat is killing me," but annotated it with disapproval, sadness and neutral instead.

Annotators felt the categories in both EARL and Circumplex were not distinct. An overlap between some classes (e.g. negative & not in control and negative thoughts) was named as one of the reasons for annotators' disagreement. Although conceptually similar, the dimensional structure of Circumplex and its choice of emotions caused more resistance among annotators, as they misinterpreted the mapping of emotions onto their categories (e.g. they disagreed that dull, sleepy and drowsy were positive affects). For both EARL and Circumplex, very little attention was paid to the categories themselves. Annotators were in favor of the examples of emotions in each category, and felt that "once they had distinguished" the nuance of emotion being represented, they "had a general feeling as to which category it belonged to". Annotators appreciated having similar emotions clustered together into a generic category. This provided them with useful cues when classifying the general mood of the text, which may be easier than choosing a specific emotion. However, they acknowledged that some information would be lost when annotating with generic categories.

When faced with WordNet–Affect, annotators were able to freely decide on a specific emotion in the hierarchy (e.g. "I don't feel well ugh," received sick, miserable, unhappy and fed-up annotations). Yet, annotators felt "restricted" as some of the emotions that they had interpreted in the text were not available (e.g. "I was certain it was relief, but it only had relieve, and they are not the same thing..."). The autocomplete functionality proved insufficient, as annotators continually searched for emotions that were not present in the lexicon. This increased the time spent in completing the annotation task. Emotional content was often described in complex terms (e.g. "I chose aggravated, because it's stronger than annoyance."), or could not be pinpointed (e.g. "You know what it means and you feel it, but you can't find the right word to describe it."). In these situations, annotators would search for the synonyms of their original choices, or search for the basic form of the emotion, until an appropriate substitute was found (e.g. "I wanted exhausted, but had to settle for tired..."). This may imply that WordNet–Affect is somewhat incomplete. A recommendation for improving the navigation of this scheme is to have a drop down menu of similar emotions in addition to the autocomplete functionality.

Free text classification scheme diminished the confidence in choosing an appropriate annotation (e.g. "There is too much choice now. I think of a word and doubt. Is this what I really mean? Because there is no guideline I doubt. When there is a group, I think 'it definitely fits here'."). Annotators described this scheme as "resembling what we do in everyday life when we read a piece of text. We read something and we feel it." However, when asked to describe an emotion using a particular word, annotators could often not articulate it (e.g. "I had multiple emotions, but could not find a word to describe them all."). Free text annotations accrued a range of lexical representations of emotions with similar valences (e.g. "My girlfriend disapproves of me :(," received self-disgust, shame and disapproval annotations) and intensities (e.g. "Don't want to see the hearse coming down my road today. RIP Anna," received trepidation, dread, sadness and upset annotations). For both lexical schemes, annotators acknowledged that "regardless of the terms we use, we are all in agreement of the general feeling expressed" in the text.

5 Utility Analysis: A Machine Perspective

Classification performance can be negatively affected by class imbalance and the degree of overlapping among the classes (Prati et al. 2004). To explore how well text classification algorithms can learn to differentiate between the classes within a given scheme, we evaluated the performance of supervised machine learning when the corresponding annotations were used to train the classification model. The ground truth annotations (see Section 4.1.2) were used to create a set of gold standards (one for each classification scheme). We then used gold standard data with Weka (Hall et al. 2009), a popular suite of machine learning software, to perform 10-fold cross-validation experiments. All text documents from the corpus of 500 tweets were converted into feature vectors using a bag-of-words representation. We tested a wide range of supervised learning methods included in Weka. Support vector machines consistently outperformed other methods. We, therefore, report the results achieved by this method. Classification performance was measured using precision (P), recall (R) and F-measure (F) and results are given in Fig. 19. One may try to assess the significance using bootstrap confidence intervals of the cross-validation results, but it has been reported that such practice may lead to bias estimation and, therefore, should be avoided (Vanwinckelen and Blockeel 2012).

Fig. 19
figure 19

The results of cross-validation experiments

The ranking of the classification schemes with respect to F-measure is similar to the ranking with respect to the IAA with the exception of the wheel of emotion and Circumplex, which swapped places. F-measure ranged from 15.9% to 41.0% with standard deviation of 9.5. Notably, there was less variation in classification performance across the schemes than in IAA. Intuitively, the classification results are expected to be inversely proportional to the number of classes in the scheme. Unsurprisingly, given the smallest number of options, the highest value F-measure (F = 41.0%) was recorded for six basic emotions. However, EARL (10 classes) is ranked behind the wheel of emotion (16 classes). WordNet Affect and free text demonstrated almost identical F-measure, which was found to be at the lower end of the spectrum.To get better insight into the classification performance across the schemes, we analyzed the confusion matrices, which show how the automatically predicted classes compare against the actual ones from the gold standard. For each scheme, confusion often occurred between opposite emotions, happiness and sadness (see Table 9). These confusions may be explained by the limitations of the bag-of-words approach, which ignores the text structure hence disregarding compositional semantics. Specifically, negation, which can reverse the sentiment of a text expression, was found to contribute to confusion. For example, "Why do you not love me? Why?: (," was automatically classified as pleasantness, caring or happiness, whereas it was annotated as unpleasantness, negative & passive and depression in the gold standard. Such predictions were largely based on the use of the word love, which represents a text feature highly correlated with the positive classes in the training set. For example, out of 14 mentions of the word love, four were used in a negative context. All three negated mentions were found within the negative examples. The remaining negative mention of the word love was sarcastic. This example illustrates the need to include negation as a salient feature.

Table 9 Misclassification of opposite emotions

The second largest consistently occurring confusion was related to the neutral category (see Table 10), which, in the absence of discriminative features, was typically misclassified as one of two largest classes in the gold standard, i.e. either happiness or sadness (see Fig. 9). Another trend noticed across all schemes, was misclassification of active negative emotions, anger, annoyance or disgust, as sadness, and slightly less the other way around (see Table 11). Again, because this behaviour is recorded consistently across all schemes, this phenomenon may be explained by the limitations of the bag-of-words approach. Further investigation is needed to determine whether a richer feature set (e.g. additional syntactic features to differentiate between active and passive voice) would help to better discriminate between these classes.

Table 10 Misclassification of the neutral category
Table 11 Misclassification of active and passive negative emotions

Whereas the classification confusions discussed above were common across all schemes, it was notable that both dimensional schemes, Circumplex and EARL, demonstrated relatively more confusion across a wider range of classes (see Tables 12 and 13). This suggests that their generic categories may not be sufficiently distinctive, and, therefore, are not the best suited for emotive language analysis.

Table 12 Confusion matrix for the classification predictions against Circumplex classes
Table 13 Confusion matrix for the classification predictions against EARL classes

6 Conclusion

We considered six emotion classification schemes (six basic emotions, wheel of emotion, Circumplex, EARL, WordNet–Affect and free text classification scheme) and investigated their utility for emotive language analysis. We first studied their use by human annotators and subsequently analyzed the performance of supervised machine learning when their annotations were used for training. For both purposes, we assembled a corpus of 500 emotionally charged text documents. The corpus was annotated manually using an online crowdsourcing platform with five independent annotators per document. Assuming that classification schemes with a better balance between completeness and complexity are easier to interpret and use, we expect such schemes to be associated with higher IAA. We used Krippendorff's alpha coefficient to measure IAA according to which the six classification schemes were ranked as follows: (1) six basic emotions (α = 0.483), (2) wheel of emotion (α = 0.410), (3) Circumplex (α = 0.312), (4) EARL (α = 0.286), (5) free text (α = 0.205), and (6) WordNet–Affect (α = 0.202). Six basic emotions were found to have a significantly higher IAA than all other schemes. However, correspondence analysis of annotations across the schemes highlighted that basic emotions are oversimplified representations of complex phenomena and as such are likely to lead to invalid interpretations, which are not necessarily reflected by high IAA. Specifically, basic emotion of happiness was mapped to classes distinct from happiness in other schemes, namely submission, love and awe in Plutchik's wheel, high positive affect (e.g. enthusiastic, excited, etc.) in Circumplex and all positive classes in EARL including caring (e.g. love, affection, etc.), positive thoughts (e.g. hope, pride, etc.), quiet positive (e.g. relaxed, calm, etc.) and reactive politeness (e.g. interest, surprise, etc.). Semi–structured interviews with the annotators also highlighted this issue. The scheme of six basic emotions was perceived as having insufficient coverage of the emotion space forcing annotators to resort to inferior alternatives, e.g. using happiness as a surrogate for love. Therefore, further investigation is needed into ways of better representing basic positive emotions by considering those naturally emerging from free text annotations: love, hope, admiration, gratitude and relief.

In the second part of the study, we wanted to explore how well text classification algorithms can learn to differentiate between the classes within a scheme. Classification performance can be negatively affected by class imbalance and the degree of overlapping among the classes. In terms of feature selection, poorly defined classes may not be linked to sufficiently discriminative text features that would allow them to be identified automatically. To measure the utility of different schemes in this sense, we created six training datasets, one for each scheme, and used them in cross–validation experiments to evaluate classification performance in relation to different schemes. According to the F-measure, the classification schemes were ranked as follows: (1) six basic emotions (F = 0.410), (2) Circumplex (F = 0.341), (3) wheel of emotion (F = 0.293), (4) EARL (F = 0.254), (5) free text (F = 0.159) and (6) WordNet–Affect (F = 0.158). Not surprisingly, the smallest scheme achieved the significantly higher F-measure than all other schemes. For each scheme, confusion often occurred between opposite emotions (or equivalent categories), happiness and sadness. These confusions may be explained by the limitations of the bag-of-words approach to document representation, which ignores the text structure hence disregarding compositional semantics. Specifically, negation, which can reverse the sentiment of a text expression, was found to contribute to confusion. Another trend noticed across all schemes was misclassification of active and passive negative emotions (e.g. anger vs. sadness). Again, this phenomenon may be explained by the limitations of the bag-of-words approach. Further investigation is needed to determine whether a richer feature set (e.g. syntactic features) would help to better discriminate between related classes.

The classification confusions discussed above were commonly found across all schemes and, as suggested, represent the effects of a document representation choice rather than specific classification schemes. However, it was notable that both dimensional schemes, Circumplex and EARL, demonstrated higher confusion across a wider range of classes. This suggests that their categories may not be sufficiently distinctive, and, therefore, are not the best suited for emotive language analysis.

To conclude, six basic emotions emerged as the most useful classification scheme for emotive language analysis in terms of ease of use by human annotators and training supervised machine learning algorithms. Nonetheless, further investigation is needed into ways of extending basic emotions to encompass a variety of positive emotions, because happiness, as the only representative of positive emotions, is forcibly used as a surrogate for a wide variety of distinct positive emotions.