Skip to main content
Log in

DEMoS: an Italian emotional speech corpus

Elicitation methods, machine learning, and perception

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

We present DEMoS (Database of Elicited Mood in Speech), a new, large database with Italian emotional speech: 68 speakers, some 9 k speech samples. As Italian is under-represented in speech emotion research, for a comparison with the state-of-the-art, we model the ‘big 6 emotions’ and guilt. Besides making available this database for research, our contribution is three-fold: First, we employ a variety of mood induction procedures, whose combinations are especially tailored for specific emotions. Second, we use combinations of selection procedures such as an alexithymia test and self- and external assessment, obtaining 1,5 k (proto-) typical samples; these were used in a perception test (86 native Italian subjects, categorical identification and dimensional rating). Third, machine learning techniques—based on standardised brute-forced openSMILE ComParE features and support vector machine classifiers—were applied to assess how emotional typicality and sample size might impact machine learning efficiency. Our results are three-fold as well: First, we show that appropriate induction techniques ensure the collection of valid samples, whereas the type of self-assessment employed turned out not to be a meaningful measurement. Second, emotional typicality—which shows up in an acoustic analysis of prosodic main features—in contrast to sample size is not an essential feature for successfully training machine learning models. Third, the perceptual findings demonstrate that the confusion patterns mostly relate to cultural rules and to ambiguous emotions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. The corpus is available upon request through a personalised download link.

  2. The consent agreement was designed by Santa Lucia Foundation (Research and Health Care Institute).

  3. http://peir.path.uab.edu/library/.

  4. To give an example: For anger produced by females, we would expect 460 samples (20 S-Chunks x 23 participants) based on Empathy MIP and 46 samples (2 Sentences x 23 participants) based on Self-statement MIP. Yet, we end up with 461 samples based on Empathy MIP (cf. P-Chunks in Table 1) and 55 samples based on Self-statement MIP (cf. Sentences in Table 1). P-Chunks can integrate across syntactic boundaries but more often, they partition S-Chunks into smaller units.

  5. A‘prototype’ is a central, natural category (Rosch 1973) with a unique representation, not composed by a combination of simpler ones.

  6. Null-Hypothesis-Testing with p-values as decisive criterion has been critised repeatedly from its beginning; we refer to the statement of the American Statistical Association in Wasserstein and Lazar (2016). Throughout this article, we will thus report p-values not as criteria for a binary yes-no decision ‘significant/not significant’ but rather as a descriptive device; note that we do not correct for repeated measurements.

  7. The down-sampling in test (cf. row 2 for F and row 6 for M in Table 12) was made to allow a comparison with the down-sampling in Train and Dev with really everything kept equal (cf. row 4 for F and row 8 for M in Table 12), by processing in both cases fully balanced groups; as expected, the classification results for the down-sampled test did not differ noticeably from those obtained for the unbalanced group.

  8. Note that we do not compare F0 values across but only within gender; thus, we do not have to take into account the different mean pitch ranges of males and females.

  9. https://cycling74.com/products/max/.

References

  • Amir, N., Ron, S., & Laor, N. (2000). Analysis of an emotional speech corpus in Hebrew based on objective criteria. In Proceedings of the ITRW, ISCA, Newcastle, UK (pp. 29–33).

  • Aubergé, V., Audibert, N., & Rilliard, A. (2003). Why and how to control the authentic emotional speech corpora. In Proceedings of the Interspeech, ISCA, Geneva, Switzerland (pp. 185–188).

  • Baiocco, R., Giannini, A. M., & Laghi, F. (2005). SAR–Scala Alessitimica Romana. Valutazione delle capacità di riconoscere, esprimere e verbalizzare le emozioni. Trento: Erickson.

    Google Scholar 

  • Bänziger, T., Pirker, H., & Scherer, K. (2006). GEMEP-GEneva multimodal emotion portrayals: A corpus for the study of multimodal emotional expressions. In Proceedings of LREC, ELRA, Genova, Italy (pp. 15–19).

  • Barkhuysen, P., Krahmer, E., & Swerts, M. (2010). Crossmodal and incremental perception of audiovisual cues to emotional speech. Language and Speech, 53(1), 3–30.

    Google Scholar 

  • Batliner, A., Fischer, K., Huber, R., Spilker, J., & Nöth, E. (2000). Desperately seeking emotions or: Actors, wizards, and human beings. In Proceedings of ITRW, ISCA (pp. 195–200).

  • Batliner, A., Hacker, C., Steidl, S., Nöth, E., D’Arcy, S., Russell, M. J., & Wong, M. (2004). ’You stupid tin box’—children interacting with the AIBO Robot: A cross-linguistic emotional speech corpus. In Proceedings of LREC, ELRA, Lisbon, Portugal (pp. 171–174).

  • Batliner, A., Steidl, S., Hacker, C., Nöth, E., & Niemann, H. (2005). Tales of tuning—prototyping for automatic classification of emotional user states. In Proceedings of Interspeech, ISCA, Lisbon, Portugal (pp. 489–492).

  • Bennett, M. J. (1979). Overcoming the golden rule: Sympathy and empathy. Annals of the International Communication Association, 3(1), 407–422.

    Google Scholar 

  • Bonny, H. L. (2002). Music and consciousness: The evolution of guided imagery and music. Gilsum, NH: Barcelona Publishers.

    Google Scholar 

  • Bradley, M. M., & Lang, P. J. (2000). Affective reactions to acoustic stimuli. Psychophysiology, 37(2), 204–215.

    Google Scholar 

  • Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., & Weiss, B. (2005). A database of German emotional speech. In Proceedings of Interspeech, Lisbon, Portugal (pp. 1517–1520).

  • Cavanagh, S. R., Urry, H. L., & Shin, L. M. (2011). Mood-induced shifts in attentional bias to emotional information predict ill-and well-being. Emotion, 11(2), 241–248.

    Google Scholar 

  • Chiţu, A. G., van Vulpen, M., Takapoui, P., & Rothkrantz, L. J. M. (2008). Building a Dutch multimodal corpus for emotion recognition. In Workshop on Corpora for Research on Emotion and Affect (pp. 53–56). Marrakesh, Morocco: LREC.

  • Ciceri, M. R., & Anolli, L. M. (2000). La voce delle emozioni: Verso una semiosi della comunicazione vocale non-verbale delle emozioni. Milan: Franco Angeli.

    Google Scholar 

  • Costantini, G., Iaderola, I., Paoloni, A., & Todisco, M. (2014). EMOVO Corpus: An Italian emotional speech database. In Proceedings of LREC, ELRA, Reykjavik, Iceland (pp. 3501–3504).

  • Cowie, R., Douglas-Cowie, E., Savvidou, S., McMahon, E., Sawey, M., & Schröder, M. (2000). Feeltrace: An instrument for recording perceived emotion in real time. In Proceedings of ITRW, ISCA, Newcastle, UK (pp. 19–24).

  • Cullen, C., Vaughan, B., Kousidis, S., Wang, Y., McDonnell, C., & Campbell, D. (2006). Generation of high quality audio natural emotional speech corpus using task based mood induction. In Proceedings of InSciT, Dublin Institute of Technology, Mérida, Spain.

  • Dan-Glauser, E. S., & Scherer, K. R. (2011). The Geneva affective picture database (GAPED): A new 730-picture database focusing on valence and normative significance. Behavior Research Methods, 43(2), 468–477.

    Google Scholar 

  • Devillers, L., Abrilian, S., & Martin, J.-C. (2005a). Representing real-life emotions in audiovisual data with non basic emotional patterns and context features (pp. 519–526). ACII.

  • Devillers, L., Vidrascu, L., & Lamel, L. (2005b). Challenges in real-life emotion annotation and machine learning based detection. Neural Networks, 18, 407–422.

    Google Scholar 

  • Douglas-Cowie, E., Campbell, N., Cowie, R., & Roach, P. (2003). Emotional speech: Towards a new generation of databases. Speech Communication, 40(1), 33–60.

    Google Scholar 

  • Douglas-Cowie, E., Cowie, R., & Schröder, M. (2000). A new emotion database: Considerations, sources and scope. In Proceedings of ITRW, ISCA, Newcastle, UK (pp. 39–44).

  • Douglas-Cowie, E., Cowie, R., Sneddon, I., Cox, C., Lowry, O., McRorie, M., Martin, J.-C., Devillers, L., Abrilian, S., Batliner, A., Amir, N., & Karpouzis, K. (2007). The HUMAINE database: Addressing the collection and annotation of naturalistic and induced emotional data. In Proceedings of ACII, AAAC, Lisbon, Portugal (pp. 488–500).

  • Douglas-Cowie, E., Cox, C., Martin, J.-C., Devillers, L., Cowie, R., Sneddon, I., et al. (2011). Data and databases. In P. Petta, C. Pelachaud, & R. Cowie (Eds.), Emotion-oriented systems: The HUMAINE handbook (pp. 163–284). Berlin: Springer.

    Google Scholar 

  • Ekman, P. (1984). Expression and the nature of emotion. Approaches to Emotion, 3, 19–344.

    Google Scholar 

  • El Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3), 572–587.

    Google Scholar 

  • Eyben, F., Salomão, G. L., Sundberg, J., Scherer, K. R., & Schuller, B. W. (2015). Emotion in the singing voice—a deeper look at acoustic features in the light of automatic classification. EURASIP Journal on Audio, Speech, and Music Processing, 1, 1–9.

    Google Scholar 

  • Eyben, F., Wöllmer, M., & Schuller, B. (2010). Opensmile: the Munich versatile and fast open-source audio feature extractor. In Proceedings of ACM Multimedia, ACM, Florence, Italy (pp. 1459–1462).

  • Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., & Lin, C.-J. (2008). Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874.

    Google Scholar 

  • Fernandez, R., & Picard, R. W. (2003). Modeling drivers’ speech under stress. Speech Communication, 40, 145–159.

    Google Scholar 

  • Fischer, A. H. (1993). Sex differences in emotionality: Fact or stereotype? Feminism and Psychology, 3, 303–318.

    Google Scholar 

  • Fontaine, J. R., Scherer, K. R., Roesch, E. B., & Ellsworth, P. C. (2007). The world of emotions is not two-dimensional. Psychological Science, 18(12), 1050–1057.

    Google Scholar 

  • Gerrards-Hesse, A., Spies, K., & Hesse, F. W. (1994). Experimental inductions of emotional states and their effectiveness: A review. British Journal of Psychology, 85(1), 55–78.

    Google Scholar 

  • Grichkovtsova, I., Morel, M., & Lacheret, A. (2012). The role of voice quality and prosodic contour in affective speech perception. Speech Communication, 54(3), 414–429.

    Google Scholar 

  • Gross, J., & Levenson, R. (1995). Emotion elicitation using films. Cognition and Emotion, 9, 87–108.

    Google Scholar 

  • Husain, G., Thompson, W. F., & Schellenberg, E. G. (2002). Effects of musical tempo and mode on arousal, mood, and spatial abilities. Music Perception: An Interdisciplinary Journal, 20(2), 151–171.

    Google Scholar 

  • Iida, A., Campbell, N., Higuchi, F., & Yasumura, M. (2003). A corpus-based speech synthesis system with emotion. Speech Communication, 40(1–2), 161–187.

    Google Scholar 

  • Johnstone, T., & Scherer, K. R. (1999). The effects of emotions on voice quality. In Proceedings of ICPhS, UCLA, San Francisco, CA (pp. 2029–2032).

  • Johnstone, T., van Reekum, C. M., Hird, K., Kirsner, K., & Scherer, K. R. (2005). Affective speech elicited with a computer game. Emotion, 5(4), 513.

    Google Scholar 

  • Keltner, D. (1996). Evidence for the distinctness of embarrassment, shame, and guilt: A study of recalled antecedents and facial expressions of emotion. Cognition and Emotion, 10, 155–172.

    Google Scholar 

  • Klasmeyer, G., Johnstone, T., Bänziger, T., Sappok, C., & Scherer, K. R. (2000). Emotional voice variability in speaker verification. In Proceedings of ITRW, ISCA, Newcastle, UK (pp. 213–218).

  • Konečni, V. J., Brown, A., & Wanic, R. A. (2008). Comparative effects of music and recalled life-events on emotional state. Psychology of Music, 36(3), 289–308.

    Google Scholar 

  • Labov, W. (1972). Sociolinguistic patterns. Philadelphia, PA: University of Pennsylvania Press.

    Google Scholar 

  • Martin, M. (1990). On the induction of mood. Clinical Psychology Review, 10(6), 669–697.

    Google Scholar 

  • Mayer, J. D., Allen, J. P., & Beauregard, K. (1995). Mood inductions for four specific moods: A procedure employing guided imagery vignettes with music. Journal of Mental Imagery, 19(1–2), 151–159.

    Google Scholar 

  • McCraty, R., Barrios-Choplin, B., Atkinson, M., & Tomasino, D. (1998). The effects of different types of music on mood, tension, and mental clarity. Alternative Therapies in Health and Medicine, 4(1), 75–84.

    Google Scholar 

  • Mencattini, A., Martinelli, E., Costantini, G., Todisco, M., Basile, B., Bozzali, M., et al. (2014). Speech emotion recognition using amplitude modulation parameters and a combined feature selection procedure. Knowledge-Based Systems, 63, 68–81.

    Google Scholar 

  • Mikula, G., Scherer, K. R., & Athenstaedt, U. (1998). The role of injustice in the elicitation of differential emotional reactions. Personality and Social Psychology Bulletin, 24(7), 769–783.

    Google Scholar 

  • Mower, E., Metallinou, A., Lee, C., Kazemzadeh, A., Busso, C., Lee, S., & Narayanan, S. (2009). Interpreting ambiguous emotional expressions. In Proceedings of ACII. Amsterdam, Netherlands: IEEE.

  • Murray, I. R., & Arnott, J. L. (1995). Implementation and testing of a system for producing emotion-by-rule in synthetic speech. Speech Communication, 16, 369–390.

    Google Scholar 

  • Ortony, A., & Turner, T. J. (1990). What’s basic about basic emotions? Psychological Review, 97(3), 315–331.

    Google Scholar 

  • Parada-Cabaleiro, E., Baird, A., Batliner, A., Cummins, N., Hantke, S., & Schuller, B. (2017). The perception of emotions in noisified non-sense speech. In Proceedings of Interspeech, ISCA, Stockholm, Sweden (pp. 3246–3250).

  • Parada-Cabaleiro, E., Costantini, G., Batliner, A., Baird, A., & Schuller, B. (2018). Categorical vs Dimensional perception of Italian emotional speech. In Proceedings of Interspeech, ISCA, Hyderabad, India (pp. 3638–3642).

  • Philippot, P. (1993). Inducing and assessing differentiated emotion-feeling states in the laboratory. Cognition and Emotion, 7(2), 171–193.

    Google Scholar 

  • Plutchik, R. (1991). The emotions. Lanham, MD: University Press of America.

    Google Scholar 

  • Roedema, T. M., & Simons, R. F. (1999). Emotion-processing deficit in alexithymia. Psychophysiology, 36(3), 379–387.

    Google Scholar 

  • Rosch, E. H. (1973). Natural categories. Cognitive Psychology, 4(3), 328–350.

    Google Scholar 

  • Russell, J. A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161–1178.

    Google Scholar 

  • Russell, J. A. (1991). In defense of a prototype approach to emotion concepts. Journal of Personality and Social Psychology, 60, 37–47.

    Google Scholar 

  • Scherer, K. R. (2005). What are emotions? And how can they be measured? Social Science Information, 44(4), 695–729.

    Google Scholar 

  • Scherer, K. R. (2013). Vocal markers of emotion: Comparing induction and acting elicitation. Computer Speech and Language, 27(1), 40–58.

    Google Scholar 

  • Scherer, K. R., & Ceschi, G. (1997). Lost luggage: A field study of emotion-antecedent appraisal. Motivation and Emotion, 21(3), 211–235.

    Google Scholar 

  • Scherer, K. R., Shuman, V., Fontaine, J. R., & Soriano, C. (2013). The grid meets the wheel: Assessing emotional feeling via self-report. In J. R. Fontaine, K. R. Scherer, & C. Soriano (Eds.), Components of emotional meaning: Asourcebook (pp. 281–298). Oxford: Oxford University Press.

    Google Scholar 

  • Schienle, A., Schäfer, A., Stark, R., Walter, B., & Vaitl, D. (2005). Relationship between disgust sensitivity, trait anxiety and brain activity during disgust induction. Neuropsychobiology, 51, 86–92.

    Google Scholar 

  • Schlosberg, H. (1954). Three dimensions of emotion. Psychological Review, 61(2), 81.

    Google Scholar 

  • Schröder, M. (2004). Speech and emotion research: An overview of research frameworks and a dimensional approach to emotional speech synthesis. PhD thesis, Saarland University.

  • Schuller, B., Batliner, A., Steidl, S., & Seppi, D. (2011). Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Communication, 53(9–10), 1062–1087.

    Google Scholar 

  • Schuller, B., Steidl, S., Batliner, A., Marschik, P. B., Baumeister, H., Dong, F., Hantke, S., Pokorny, F., Rathner, E.-M., Bartl-Pokorny, K. D., Einspieler, C., Zhang, D., Baird, A., Amiriparian, S., Qian, K., Ren, Z., Schmitt, M., Tzirakis, P., & Zafeiriou, S. (2018). The Interspeech 2018 computational paralinguistics challenge: Atypical and self-assessed affect, crying and heart beats. In Proceedings of Interspeech, ISCA, Hyderabad, India (pp. 122–126).

  • Schuller, B., Steidl, S., Batliner, A., Vinciarelli, A., Scherer, K., Ringeval, F., Chetouani, M., Weninger, F., Eyben, F., Marchi, E., Mortillaro, M., Salamin, H., Polychroniou, A., Valente, F., & Kim, S. (2013). The Interspeech 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism. In Proceedings of Interspeech, ISCA, Lyon, France (pp. 148–152).

  • Schutte, N. S., Malouff, J. M., Hall, L. E., Haggerty, D. J., Cooper, J. T., Golden, C. J., et al. (1998). Development and validation of a measure of emotional intelligence. Personality and Individual Differences, 25(2), 167–177.

    Google Scholar 

  • Singhi, A., & Brown, D. G. (2014). On cultural, textual and experiential aspects of music mood. In Proceedings of ISMIR, ISMIR, Taipei, Taiwan (pp. 3–8).

  • Sobin, C., & Alpert, M. (1999). Emotion in speech: The acoustic attributes of fear, anger, sadness, and joy. Journal of Psycholinguistic Research, 28(4), 347–365.

    Google Scholar 

  • Tato, R., Santos, R., Kompe, R., & Pardo, J. M. (2002). Emotional space improves emotion recognition. In Proceedings of ICSLP, ISCA, Denver, CO (pp. 2029–2032).

  • Tolkmitt, F. J., & Scherer, K. R. (1986). Effect of experimentally induced stress on vocal parameters. Journal of Experimental Psychology: Human Perception and Performance, 12(3), 302–313.

    Google Scholar 

  • Truong, K. P., Van Leeuwen, D. A., & de Jong, F. M. G. (2012). Speech-based recognition of self-reported and observed emotion in a dimensional space. Speech Communication, 54(9), 1049–1063.

    Google Scholar 

  • Türk, U. (2001). The technical processing in smartkom data collection: A case study. In Proceedings of Eurospeech, ISCA, Aalborg, Denmark (pp. 1541–1544).

  • Utay, J., & Miller, M. (2006). Guided imagery as an effective therapeutic technique: A brief review of its history and efficacy research. Journal of Instructional Psychology, 33, 40–44.

    Google Scholar 

  • Van der Does, W. (2002). Different types of experimentally induced sad mood? Behavior Therapy, 33(4), 551–561.

    Google Scholar 

  • Västfjäll, D. (2001). Emotion induction through music: A review of the musical mood induction procedure. Musicae Scientiae, 5(1), 173–211.

    Google Scholar 

  • Vaughan, B. (2011). Naturalistic emotional speech corpora with large scale emotional dimension ratings. PhD thesis, Dublin Institute of Technology.

  • Velten, E. (1968). A laboratory task for induction of mood states. Behaviour Research and Therapy, 6(4), 473–482.

    Google Scholar 

  • Ververidis, D., & Kotropoulos, C. (2006). Emotional speech recognition: Resources, features, and methods. Speech Communication, 48(9), 1162–1181.

    Google Scholar 

  • Wasserstein, R. L., & Lazar, N. A. (2016). The ASA’s statement on p-values: Context, process, and purpose. The American Statistician, 70, 129–133.

    Google Scholar 

  • Westermann, R., Stahl, G., Spies, K., & Hesse, F. W. (1996). Relative effectiveness and validity of mood induction procedures: A meta-analysis. European Journal of Social Psychology, 26, 557–580.

    Google Scholar 

  • Williams, C. E., & Stevens, K. N. (1972). Emotions and speech: Some acoustic correlates. The Journal of the Acoustical Society of America, 52(4B), 1238–1250.

    Google Scholar 

  • Zhang, T., Hasegawa-Johnson, M., & Levinson, S. (2004). Children’s emotion recognition in an intelligent tutoring scenario. In Proceedings of Interspeech, ISCA, Jeju Island, Korea (pp. 1441–1444).

  • Zou, C., Huang, C., Han, D., & Zhao, L. (2011). Detecting practical speech emotion in a cognitive task. In Proceedings of ICCCN, IEEE, Maui, HI (pp. 1–5).

Download references

Acknowledgements

This work was supported by the European Union’s 7th Framework programme under Grant Agreement No. 338164 (ERC StG iHEARu). We would also like to thank Arianna Mencattini for her help with the texts used in the induction procedure as well as Barbara Basile for her advices in the psychological measurement.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Emilia Parada-Cabaleiro.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Parada-Cabaleiro, E., Costantini, G., Batliner, A. et al. DEMoS: an Italian emotional speech corpus. Lang Resources & Evaluation 54, 341–383 (2020). https://doi.org/10.1007/s10579-019-09450-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-019-09450-y

Keywords

Navigation