Introduction

Emotions help people communicate and understand others’ opinions by conveying feelings and giving feedback to people [46]. Human speech renders a real and instinctive interface for communication with robots and is thus widely integrated into robots to interact with humans. Speech emotion recognition is the act of attempting to understand the aspects of speech irrespective of the semantic contents and recognize the desired emotions using voice signals [19]. To enable robots to perceive a user’s emotions accurately, a speech emotion recognition system can be integrated with simple speech recognition; however, the system should identify emotions for each individual independently of cultural and linguistic diversity.

Cross-corpus emotion recognition is the act of attempting to build classifiers that generalize across application scenarios and acoustic conditions and is highly relevant for constructing effective and practical speech emotion recognition systems [38]. Research has shown cross-corpus emotion recognition to be challenging for several reasons like differences in signal level, type of emotion elicitation, data scarcity, etc. Many researchers have tried to tackle these problems by creating their emotional corpus [20, 27], trying out different feature sets [46], or using multiple machine learning models, but still, there is a lot of room for improvement. Ensemble learning helps to improve the performance of the machine learning models[17, 29, 33]. This prompts for further exploration of different techniques that can be used to improve cross-corpus speech emotion recognition that will enable the deployment of speech emotion recognition systems in real-life applications.

Human speech is so diverse and dynamic that no model can be reserved to be used forever [42]. This diversity of languages cause an imbalance of available datasets for emotion recognition for minority languages like Urdu or Sindhi vs. well-established majority languages like English. There is a need to establish a model that can be generalized for multi-lingual emotional data using the datasets available for us to use. The researchers need to examine how minority languages perform on models trained in majority languages.

Different machine learning algorithms [32] have been used to accurately classify emotions with-in the same corpus, but when applied to cross-corpus, the performance has been average. This highlights the fact that machine learning algorithms can detect emotions with-in the same corpus, but for cross-corpus, the researchers need to identify a way to utilize the ability of these machine learning algorithms to detect emotions to map out for cross-corpus data.

Existing studies [1, 37, 38] have either extracted an enormous amount of features that contribute to large computing times or have used a single machine learning algorithm [11, 20], to classify emotions into its respective categories that have deprived us of using the information each classifier has to offer and instead rely on a single classifier which has proved to give lower accuracy than desired.

In this paper, researchers propose a speech emotion recognition system for robots that uses a combination of different audio features to detect accurate emotion with-in a corpus as well as cross-corpus using the ensemble learning approach. For this, the researchers use corpora in four different languages (Urdu, English, German, and Italian) and have chosen to conduct experiments with Urdu as the base language for various scenarios against the other three languages. The researchers investigate the effect of combining the classifiers used most popularly for speech emotion recognition by using a majority voting approach and demonstrate how it enhances cross-lingual emotion recognition.

In this paper, the researchers make the following contributions:

  • Propose an effective ensemble learning approach to identify and detect cross-corpus emotions.

  • Evaluate the effectiveness of the ensemble technique.

  • Present a comparative analysis of conventional machine learning techniques: decision tree (J48), random forest (RF), and sequential minimal optimization (SMO) with an ensemble of these machine learning algorithms using majority voting.

  • Ensemble learning approach effectively enhances the detection of emotion and achieves good accuracy on both with-in as well as cross-corpus data in comparison with conventional machine learning techniques.

Fig. 1
figure 1

Graphical representation of proposed ensemble learning approach for multi-lingual speech emotion recognition

The rest of the paper is organized as follows. “Related work” briefly covers the technical background and recent research on cross-corpus speech emotion recognition. “Proposed approach” presents an overview of our proposed approach of ensemble learning for cross-corpus speech emotion recognition. The experimental setup and results are articulated in “Evaluation and results”. “Comparative analysis” presents comparative analysis and “Conclusion” concludes along with directions for future work.

Related work

Over the past 2 decades, there has been significant research on speaker-independent speech emotion recognition. This research has highlighted multiple factors that influence accurate detection of emotion; for example, the data set used, the features extracted, or the classifier used to predict emotions. Sailunaz et al. [36] described a detailed survey on multiple datasets available, the features extracted, and the models most used by multiple researchers. However, there is limited research available on multi-lingual cross-corpus speech emotion recognition. Initial studies exist on improving the sturdiness of multi-lingual speech emotion recognition by combining several emotional speech corpora within the training set and by that reducing the paucity of data [22].

The authors in [8] performed pilot experiments using support vector machines on four datasets of two different languages (German and English) to show the practicality of cross-corpus emotion recognition. The authors in [37] have performed experiments using support vector machine on six datasets in three different languages (German, English, and Danish) and revealed the drawbacks of existing analysis and corpora. The authors in [1] developed an ensemble SVM for speech emotion recognition whose focus was on emotion recognition in never seen languages.

The authors in [35] identified a speaker’s language to some extent and chose an appropriate model based on that knowledge. The authors in [44] chose an unsupervised learning approach to identify emotion on unlabeled data and found that unlabeled training data give approximately half of the gain that can be exacted from adding labeled training data. In [23], the authors used a three-layer model on corpora from three languages (German, Chinese, and Japanese) and found it accurate, yielding small errors. Li and Akagi [24] focused on choosing generalizable features from prosodic, spectral, and glottal waveform domains for multi-lingual speech emotion recognition. In [6], the authors used sparse autoencoders for feature transfer learning in speech emotion recognition. They used six standard databases and used the single-layer sparse autoencoder and trained this model on class-specific instances from the target domain, and then applied this representation to the source domain to reconstruct those data. This experimental approach improves the model’s performance as compared to independent learning from every source domain. In [21], the authors used deep belief networks (DBN) for emotion recognition and found that networks with generalization power like deep belief networks are better than traditional discriminative networks like sparse auto en-coders, but this needs to be further investigated.

In [26], authors performed emotion recognition on two languages (English and French) and examined the performance of one model trained on multiple languages. Elbarougy et al. [7] examined the distinctions and commonalities of emotions in valence-activation space between three languages (Japanese, Chinese, and German) using 30 speakers and proved that emotions are almost similar between speakers speaking different languages. In [27], authors created a new emotional database named EmoSTAR in two languages (Turkish and English) and conducted cross-corpus tests with a German dataset using SVM. In [43], the authors performed experiments on three emotion corpora (Danish, Mandarin Chinese, and German) and achieved results that indicate universal cue in emotion expression regardless of language.

In [20], the authors created a new emotional database in Urdu language and performed experiments on three different language corpora (German, English, and Italian) using SVM classifier and evaluated the results of training and testing a model using different languages and found that adding some testing language data to the training data can improve performance. The authors in [45] used 1D and 2D CNN-LSTM networks to identify speech emotions. The authors in [40] analyzed the effect noise removal techniques have on SER systems. The authors in [11] performed transfer learning and multi-task learning experiments and found that traditional machine learning models may function as well as deep learning models [2, 41] for speech emotion recognition given the researchers choose the right input feature.

Table 1 Corpora information

Proposed approach

Many factors influence the accurate detection of emotion in a cross-corpus setting. The dataset used, the features extracted from the audio signals, and the classifiers used to detect emotion all factors can significantly influence your results. Figure 1 summarizes our approach for multi-lingual speech emotion recognition. This study works on four corpora (SAVEE, URDU, EMO-DB, and EMOVO) that give a diversity of languages (English, Urdu, German, and Italian) to test for multi-lingual speech emotion recognition. To ensure that researchers have the same class labels for every dataset, this study uses the binary valence (positive and negative) approach, as presented in Table 1. The proposed approach works by extracting a combination of spectral and prosodic features from raw audio files to feed into the classifier. The Ensemble learning approach through majority voting is used to train the model to classify emotions into their respective category accurately. Further details on the selected databases, speech features extracted, and the Ensemble classifiers are presented below.

Speech emotion databases

For multi-lingual speech emotion recognition, the data should be diverse. For this study, four datasets, each with a different language, are selected based on their recording environments, the categories of emotion classes available, and the balance between positive and negative valence classes.

SAVEE

The surrey audio–visual expressed emotion (SAVEE) database [13] was recorded from four male English speakers. Emotion is categorized into seven discrete categories: anger, disgust, happy, sad, fear, neutral, and surprise. There are a total of 120 utterances for each speaker. The audio has been recorded in a controlled environment and is acted out by the speakers. The corpus is publicly availableFootnote 1 for research.

Urdu

The Urdu database [20] contains audio recordings collected from Urdu TV talk shows, consisting of 400 recordings from 38 speakers (27 male, 11 female). The data are collected for four basic emotions: anger, happy, sad, and neutral. This corpus contains natural emotional excerpts from real and unscripted discussions between different guests of TV talk shows. The dataset is publicly availableFootnote 2 for research.

EMO-DB

The Berlin database of emotional speech [3] is a German database containing speech audios from 10 actors (5 male, 5 female). The data consist of 10 German sentences recorded in anger, boredom, disgust, fear, happiness, sadness, and neutral. This database has 497 annotated utterances and has been recorded in a studio with trained actors to get an appropriate emotional response. This corpus is availableFootnote 3 for research purposes.

EMOVO

EMOVO is an Italian speech emotion database [5] that consists of recordings from 6 actors (3 male, 3 female) simulating 7 emotional states: disgust, fear, anger, joy, surprise, sadness, and neutral. There are 14 sentences uttered for each emotion and have a total of 588 annotated audio recordings. These audio recordings were recorded in a studio by trained actors and are the first emotional database for the Italian language, and are available online.Footnote 4

Feature extraction

The authors in [11] deduced that choosing the right input features can be the key to efficient recognition of emotion [30]. This work experimented with different types of features, both spectral as well as prosodic, against each dataset. Mel-frequency Cepstral Coefficients (MFCC) are among the most widely used features for speech and emotion recognition. To generate MFCCs, researchers use Librosa [25] Python library. This study considers the first 20 sets of MFCCs for experimentation. Aside from MFCCs, Spectral (Roll-off, flux, centroid, bandwidth), Energy (Root-mean-square energy), Raw Signal (Zero crossing rate), Pitch (Fundamental frequency), and Chroma features are also used for experimentation. Each feature is calculated at every 0.02 s of the audio files. Then, the researchers use the most common statistical approach and take the median of all the values calculated at each frame to constitute the value for the corresponding feature. Table 2 describes the features extracted from each feature group. A total of 28 features are extracted against each audio file, and the results are stored in a CSV file.

To test the performance of the selected functions as input functions, this work uses a different set of features, i.e., eGeMAPS, which consists of 88 features connected to energy, spectrum, frequency, Cepstral, and dynamic information. Details on these features can be found in [10]. To extract eGeMAPS features, the researchers use openSMILE toolkit [9] and save results in a CSV file.

Table 2 Features extracted
Fig. 2
figure 2

Results achieved using Urdu as training set, Urdu as testing set, and with-in corpus experiments

Preprocessing

An imbalanced dataset causes machine learning algorithms to under-perform [14, 18, 28, 31]. The synthetic minority oversampling technique (SMOTE) [4, 15, 16] is a powerful approach to tackle the class imbalance problem. After feature extraction [34], SMOTE is used to balance the instances in each class for our experimentation. After feature extraction, the data have a wide range of values that need to be converted to a common scale for our classifiers to perform well. Data normalization is performed to scale the values of our features between 0 and 1 [12].

Fig. 3
figure 3

With-in-corpus results

Classification models and parameter setting

For experimentation, this approach uses Support Vector Machines (SVM) to provide good classification results even if the researchers have a small dataset. SVM is known to perform well in higher dimension data, which can usually be the case when working with audio data, and it has been widely used for speech emotion recognition . The proposed approach uses SVM with puk Kernel, complexity 1.0, and pairwise multi-class discrimination based on Sequential Minimal Optimization. Furthermore, this study uses Random Forest, another benchmark classifier used widely for classification problems. This work uses the Random Forest with 10 trees for experimentation. Decision Tree (J48) is also used to classify data into its respective category. Decision Trees are used with a confidence factor of 0.25 for pruning, and the minimum number of instances per leaf was set to 2. Finally, this study uses an ensemble learning approach through majority voting. The proposed approach utilizes SMO, RF, and J48 classifier in an ensemble for cross-corpus emotion recognition.

Evaluation and results

This study conducts multiple experiments by setting Urdu as the base language to test against the remaining three languages (English, German, and Italian). The researchers use the ’leave-one-speaker-out’ scheme to split our data into training and testing sets. The researchers use accuracy, precision, recall, and f-score to evaluate the proposed ensemble model’s performance.

Figure 2 gives an overview of the results achieved. This work experiment uses multiple machine learning languages and an ensemble learning approach, described below.

With-in corpus experiments

This work conducts with-in corpus experiments to establish a baseline for features with the researchers’ corpora using classifiers’ set. For this experiment, the researchers use training and testing data from the same corpus. This helps to understand how well the models can perform given a certain corpus. As depicted in Fig. 3, the Urdu corpus gives impressive results as SMO gave an accuracy of 98.5% followed by the ensemble with an accuracy of 96.75%. For the EMO-DB (German) corpus, SMO gave an accuracy of 90.4% followed closely by the ensemble learning approach which gives an accuracy of 89.75%. For SAVEE (English), corpus RF gives the highest accuracy of 70.14%, while ensemble learning gives 69.31%. Finally, for EMOVO (Italian) database, SMO gives an accuracy of 89.41% followed by an ensemble learning approach with an accuracy of 87.14%. From this experiment, the researchers observe that no matter which algorithm gives the highest accuracy, ensemble learning stood second and not by much margin. SMO may perform better for some corpus, while RF may be best for another. The researchers cannot generalize one classifier working best for cross-corpus data. On the other hand, the ensemble learning approach gives us comparable results that can be used for cross-corpus speech emotion recognition without having to compromise on a lower accuracy rate for some language.

Cross-corpus experiments

For this set of experiments, the experiment pattern of [20] is followed. This work first uses Urdu data for training the model and testing it against the three western languages (English, German, and Italian). This study performs experiments using the three machine learning algorithms (SMO, RF, and J48) and the ensemble learning approach. Tables 3, 4, and 5 depicts the performance of the classifier against each corpus. It was interesting to note that a different classifier was seen to perform the best for each corpus. When testing with data from EMO-DB (German) corpus, the classifier SMO with puk kernel performs the best and gives an accuracy of 63% while the other classifiers give a lower accuracy. When testing with data from EMOVO (Italian) corpus, random forest (RF) performs the best and gives an accuracy of 60.02%, while the other classifiers give a lower accuracy. Finally, when testing on SAVEE (English) corpus, J48 gives an accuracy of 48.34%, which is again higher than the other classifiers. This observation leads to a question: which classifier should be used to implement a multi-lingual speech emotion recognition system? The ensemble learning approach may not give the best accuracy, but it shows promising results when trained using Urdu data and tested against the other three corpora. It answers the question of which classifier to use by combining the effect of all three classifiers. This ensemble uses a majority voting approach that ensures accuracy for a cross-corpus model.

Fig. 4
figure 4

Performance comparison of proposed approach with referred paper

For the next set of experiments, the proposed approach uses EMO-DB corpus for training and Urdu data for testing. This study evaluates all classifiers and gets an accuracy of 60% from the J48 classifier, while other classifiers give moderate accuracy, as shown in Table 6. This study then uses EMOVO (Italian) corpus for training the models and testing them against Urdu data. In this case, the Ensemble gives the highest accuracy of 62.5%, while individual classifiers give lower accuracy, as shown in Table 7. Finally, this study trains the models using SAVEE (English) corpus while tests it using Urdu data. SMO classifier gives the highest accuracy of 50%, while the other classifiers give an inferior performance, as shown in Table 8.

This set of experiments also support the observation that no one classifier was performing the best for every scenario.

Table 3 Training on Urdu corpus, testing on Italian corpus
Table 4 Training on Urdu corpus, testing on German corpus
Table 5 Training on Urdu corpus, testing on English corpus
Table 6 Training on German corpus, testing on Urdu corpus
Table 7 Training on Italian corpus, testing on Urdu corpus
Table 8 Training on English corpus, testing on Urdu corpus
Fig. 5
figure 5

Performance comparison of proposed approach with referred paper setting Urdu data as training data and testing on data from other languages

Fig. 6
figure 6

Performance comparison of the proposed approach with the referred paper setting Urdu data as testing data and training on data from other languages

Comparative analysis

To analyze the efficacy of the proposed approach, this study compares the results with a distinguished research conducted by [20] whose pattern of experimentation was followed in this study. The authors have extracted eGeMAPS [10] features from their raw audio data. They have used SVM with a gaussian kernel for classifying data into their respective categories. Figure 4 compares the accuracy of the proposed ensemble learning approach with the referred research paper’s accuracy. For the Urdu database, the ensemble learning approach shows increased accuracy by 13%. For EMO-DB, the accuracy increased by 8% using ensemble learning. For EMOVO (Italian) corpus, the ensemble learning improved the accuracy by 11%. Finally, for SAVEE (English) corpus, almost a 5% increase in accuracy was achieved using the ensemble learning approach.

Figures 5 and 6 present an overview of cross-corpus comparison. When training against Urdu corpus, EMO-DB (German) and EMOVO (Italian) give us an increased accuracy of 2% and 15%, respectively. For the SAVEE corpus, this study observes a decline of 6% using the ensemble learning approach. When testing using the Urdu corpus, this work achieves an increased accuracy of 7%, 3%, and 5% for German, Italian, and English corpus, respectively.

Conclusion

The paradigm shift from textual to more intuitive control mechanisms like speech in human–robot interaction (HRI) has opened several research areas, including speech emotion recognition. A lot of past research for speech emotion recognition has been focused on using the data from the same corpus for both training and testing. This study proposed an ensemble learning technique through majority voting to tackle emotions in multiple languages and enable the robots to perform globally. It is observed that different classifiers worked differently for different languages, which raised the question of which classifier works best for all languages. The Ensemble learning approach, which uses the three most popular machine learning algorithms and implements a majority voting scheme, gave comparable results for all languages. This finding can be very helpful for developing an emotion recognition system for robots designed to handle customers from all corners of the globe [39]. It will enable the robots to interact with customers smartly with emotional intelligence, which can have a huge impact on the way the world interacts with robots. The researchers plan to explore more machine learning algorithms to be used in an ensemble in the future. To enable the application of our research in real-life scenarios, the researchers want to experiment with different speech databases containing audios recorded in a natural environment. Moreover, the researchers plan to analyze the effect of using different ensemble techniques and achieving higher accuracy rates. The most challenging task for future researchers would be finding corpora for different languages in the natural environment as there are not many readily available. Second, selecting algorithms that perform consistently for all languages in both natural and recorded environments.