Acoustic and temporal representations in convolutional neural network models of prosodic events
Introduction
Prosodic event detection refers to the task of automatically assigning a prosodic event such as a pitch accent or an intonational phrase boundary to syllables or words in transcribed speech data. It is the underlying task in automatic annotation tools (Rosenberg, 2010), can aid linguistic research on large corpora (Schweitzer, 2010) and is a useful component of spoken language technologies (Wightman and Ostendorf, 1994, Shriberg and Stolcke, 2004, Kompe, 1997, Batliner et al., 2001b) due to the connection of prosody to syntax and meaning. Typical approaches to prosodic event detection use machine learning methods that are trained on manually annotated data. The features can range from low-level representations of raw speech or simple frame-based acoustic descriptors, to more linguistically motivated features such as those computed for entire segments, or high-level features that require additional pre-processing and hand-crafting. While the exact choice of representations varies across different approaches, they mainly consist of acoustic features that describe energy and pitch information of the current word or syllable and its surrounding context, as well as temporal features, such as the duration of syllables, words or pauses.
In addition to other supervised machine learning methods such as ensemble learning (Sun, 2002, Schweitzer and Möbius, 2009), neural networks have become a popular approach to detecting prosodic events from speech data (Rosenberg et al., 2015, Wang et al., 2016, Li et al., 2018). The use of such methods is motivated by the notion of letting the model learn the best feature representation on its own. This often produces better results, since manually computed features may not always be the optimal choice for a given task. This is because there may be latent but useful information in the data that is not captured by these features. Neural networks are trained to automatically find suitable representations that lead to better classification decisions.
We have previously proposed a method of using a convolutional neural network (CNN) as an efficient and strong modeling technique to detect pitch accents and intonational phrase boundaries at the word level (Stehwien and Vu, 2017). In contrast to other methods that aim to provide a fine-grained labeling of prosody for linguistic analyses (e.g. pitch accent shapes at the syllable-level, Schweitzer and Möbius, 2009), this method is suitable for application in speech processing pipelines since it is readily implemented1 and requires very little pre-processing. The only segmental information required is time-alignment at the word level. The speech signal is represented by a small set of frame-based acoustic descriptors that can be efficiently extracted using signal processing tools.
The CNN takes these comparably “lower-level” acoustic features as input and learns a “higher-level” feature representation that is used to classify words as carrying either a prosodic event or not. The resulting feature representation was learned automatically from the acoustic input features. However, we do not know exactly what information has been learned. Following the increasing popularity and success of neural networks in the field of speech and language processing, the question of what these models are learning is currently attracting more and more interest (notable recent venues are e.g. Interspeech 2018 special session “Deep neural networks: how can we interpret what they have learned?”, and the BlackboxNLP workshop at EMNLP 2018 and ACL 2019).
We have previously introduced a method of analyzing the learned output representations of the CNN (Stehwien et al., 2019) and used this approach to show that the CNN can learn duration information for detecting pitch accents and lexical stress. In this study, we investigated how different methods of max pooling in the CNN, and zero padding of the input, affect how word or syllable duration is learned for various pitch accent and lexical stress detection tasks. The results showed that as long as there is a correlation between duration and the target label (e.g. a pitch accent), the output representation of the CNN can be found to encode duration. We also demonstrated that, while the detection performance is similar, using a 3-max pooling over a 1-max pooling technique captures more context information, and that the duration of the current word appears to be most relevant for both methods.
In this paper, we address further questions on what information the CNN has learned using the same method. Specifically, we analyze what information is latent in the high-level output representation. We describe experiments on three different corpora and show results for pitch accent and phrase boundary detection. Along with our previous publication, this is the first study to utilize this type of analysis for prosodic event detection. Building on our previous findings, we focus on temporal information, e.g. word duration. We also extend our investigation of the role of context information in the CNN (first discussed in terms of performance in Stehwien and Vu, 2017) to the question of what information from the surrounding acoustic context is most important for the CNN-based prosodic event detector.
Our overall aim in this work is, therefore, to gain insight into what the CNN is learning. This is an important question since even rather standard neural network architectures, as the ones used in this paper, produce representations that are still not readily interpreted. The fact that the chosen architecture is comparably simple makes it efficient to train and more readily analyzed. Furthermore, while there is a substantial amount of previous work on prosodic event detection that focuses on the role of various cues, such as duration, context and different acoustic features, this work does not introduce new features, but rather discusses their role in a convolutional neural network and addresses the question of what a neural network would learn when given a frame-based representation of the speech signal.
To provide a comparison of how this method of prosodic event detection performs compared to other methods, we report not only within-corpus, but also cross-corpus results on two English datasets from different speech genres as well as one German corpus. We show that the performance of this method can be compared to, and, in some cases, outperforms that of related work.
The main contribution of this paper is organized into three parts: First, we investigate the role of different types of temporal information in the CNN, namely word duration as a feature in the model and the use of pause transcriptions as a pre-processing step. We assume that the CNN can learn much of this information on its own, since word duration and the presence of pauses are implicitly included as part of the frame-based acoustic input. We report quantitative results of experiments comparing the CNN to a model that does not have access to this information; a simple feed-forward neural network (FFNN) with acoustic features that are aggregated across entire words. The results show that the FFNN can benefit from adding temporal information manually, while the CNN does not, which supports our above assumption.
Next we apply our method of analyzing what the CNN has learned. We conduct a linear regression analysis between the high-level feature representation created by the CNN and a set of aggregated features. We apply this method to estimate how much of the information represented in the aggregated features is latent in, or correlated with, the CNN output. We use this as an indication of what the CNN is learning. The results in this paper provide further evidence that the CNN can learn duration on its own, and that it can learn to locate relevant voiced regions in speech for detecting pitch accents and phrase boundaries.
In our final analysis, we take a closer look at the amount of acoustic and temporal context information that is learned by the CNN and compare it to a long short-term memory network (LSTM), which, as a sequential model, is specifically designed to take context into account (which sets it apart from a CNN). We found that while there are differences in how much context information is learned by the different network types, the most important acoustic and temporal features in both networks pertain to the current word.
This paper is organized as follows: In the next section, we provide an overview of related work. We describe the CNN model and discuss its performance in Section 3. In Section 4, we investigate the role of temporal information in the CNN-based prosodic event detector. Our method of analyzing the CNN output representations and the our results are described in Section 5. Section 6 contains the comparative analysis of the CNN and LSTM. We discuss the results of the analysis in Section 7. Section 8 concludes the paper.
Section snippets
Related work
An early example of using neural networks to locate pitch accents and phrase boundaries using frame-based features was presented by Taylor (1995). More recently, Rosenberg et al. (2015) used recurrent neural networks (RNN) for prominence detection. They used several pre-computed features that were aggregated over entire words as input to an RNN that processes and labels entire word sequences. Li et al. (2018) used multi-distribution neural networks for lexical stress and pitch accent detection
Data
We used three different speech corpora in these experiments; two American English corpora and one German corpus.
The first English corpus is the Boston University Radio News Corpus (BURNC) (Ostendorf et al., 1995). It is one of the most widely used corpora for research on automatic prosodic labeling. This corpus consists of recordings of radio news broadcasts, read by professional speakers. The recordings were orthographically transcribed, automatically force-aligned and manually corrected. In
The role of temporal information
As a first analysis using quantitative measures, we investigated how temporal information, namely word duration and the presence of pauses, affects the performance of the CNN.
Apart from acoustic cues, pitch accents and phrase boundaries have temporal correlates: Pitch accents often occur on content words, which tend to be longer, and prominence itself can cause lengthening of certain segments. Phrase boundaries often occur before pauses and breaks, and cause pre-boundary lengthening at the
Analysis of acoustic feature representations
In this section, we describe our method of analyzing the latent information encoded in the high-level representations learned by the CNN. We have previously described this method for finding evidence of duration information in the neural network output (Stehwien et al., 2019). In this paper, we take a closer look at the acoustic representations encoded in the CNN output and also include a “reverse” analysis, which we describe in the following.
The role of context information: convolutional vs. sequential models
In previous work, we discussed the role of context information in combination with the position indicators in terms of performance (Stehwien and Vu, 2017) and in terms of how much context word duration is learned using different pooling methods in Stehwien et al. (2019). We found that the method of padding the input matrix and then performing 1-max pooling makes it necessary to add position indicators that mark which frames pertain to the context or current word, and that therefore, the CNN
Discussion
The method of analyzing the output of a neural network using linear regression provides a rough estimation about what information is latent in the CNN output features. It can be used to test hypotheses by measuring correlations between manually defined candidate features and the learned output. We believe that the fact that many results were in line with our expectations also supports our position that this is a suitable and readily implemented method of analyzing what a neural network has
Conclusion
In this paper, we investigated what information is encoded in the learned feature representations of a convolutional neural network (CNN) that is trained to detect prosodic events. This method yields good results on English and German data using frame-based acoustic descriptors as features. The primary focus of this paper was the role of the different acoustic features and of temporal information in the CNN. We compared the CNN to a feed-forward neural network (FFNN) as a model that does not
CRediT authorship contribution statement
Sabrina Stehwien: Conceptualization, Methodology, Investigation, Software, Visualization, Writing - original draft, Writing - review & editing. Antje Schweitzer: Supervision, Methodology, Writing - review & editing. Ngoc Thang Vu: Conceptualization, Methodology, Supervision, Resources, Funding acquisition.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The authors would like to thank the anonymous reviewers for their helpful comments. This work was funded by the Sonderforschungsbereich (collaborative research center) SFB 732, project A8, of the German National Science Foundation (DFG) .
References (39)
- et al.
Automatic lexical stress and pitch accent detection for L2 English speech using multi-distribution deep neural networks
Speech Commun.
(2018) - et al.
Analysis of sentence embedding models using prediction tasks in natural language processing
IBM J. Res. Dev.
(2017) - Batliner, Anton, Buckow, Jan, Huber, Richard, Warnke, Volker, Nöth, Elmar, Niemann, Heinrich, 2001a. Boiling down...
- et al.
Prosodic models, automatic speech understanding, and speech synthesis: Towards the common ground
- Batliner, Anton, Nöth, Elmar, Buckow, Jan, Huber, Richard, Warnke, Volker, Niemann, Heinrich, 2001c. Duration features...
Keras
(2015)- et al.
A discourse information radio news database for linguistic analysis
- Eyben, Florian, Weninger, Felix, Groß, Florian, Schuller, Björn, 2013. Recent developments in openSMILE, the Munich...
- Hirschberg, Julia, Nakatani, Christine H., 1996. A prosodic analysis of discourse segments in direction-giving...
- Kakouros, S., Suni, A., Simko, J., Vainio, M., 2019. Prosodic representations of prominence classification neural...