Elsevier

Speech Communication

Volume 125, December 2020, Pages 128-141
Speech Communication

Acoustic and temporal representations in convolutional neural network models of prosodic events

https://doi.org/10.1016/j.specom.2020.10.005Get rights and content

Highlights

  • Prosodic events are efficiently detected using convolutional neural networks.

  • Analysis of learned representations by predicting candidate features using linear regression.

  • Investigation of latent acoustic, contextual and word duration information.

  • Comparison of representations learned in convolutional and sequential (LSTM) models.

Abstract

Prosodic events such as pitch accents and phrase boundaries have various acoustic and temporal correlates that are used as features in machine learning models to automatically detect these events from speech. These features are often linguistically motivated, high-level features that are hand-crafted by experts to best represent the prosodic events to be detected or classified. An alternative approach is to use a neural network that is trained and optimized to learn suitable feature representations on its own. An open question, however, is what exactly the learned feature representation consists of, since the high-level output of a neural network is not readily interpreted. In this paper, we use a convolutional neural network (CNN) that learns such features from frame-based acoustic input descriptors. We are concerned with the question of what the CNN has learned after being trained on different datasets to perform pitch accent and phrase boundary detection. Specifically, we suggest a methodology for analyzing what temporal, acoustic and context information is latent in the learned feature representation. We use the output representations learned by the CNN to predict various manually computed (aggregated) features using linear regression. The results show that the CNN learns word duration implicitly, and indicate that certain acoustic features may help to locate relevant voiced regions in speech that are useful for detecting pitch accents and phrase boundaries. Finally, our analysis of the latent contextual information learned by the CNN involves a comparison with a sequential model (LSTM) to investigate similarities and differences in what both network types have learned.

Introduction

Prosodic event detection refers to the task of automatically assigning a prosodic event such as a pitch accent or an intonational phrase boundary to syllables or words in transcribed speech data. It is the underlying task in automatic annotation tools (Rosenberg, 2010), can aid linguistic research on large corpora (Schweitzer, 2010) and is a useful component of spoken language technologies (Wightman and Ostendorf, 1994, Shriberg and Stolcke, 2004, Kompe, 1997, Batliner et al., 2001b) due to the connection of prosody to syntax and meaning. Typical approaches to prosodic event detection use machine learning methods that are trained on manually annotated data. The features can range from low-level representations of raw speech or simple frame-based acoustic descriptors, to more linguistically motivated features such as those computed for entire segments, or high-level features that require additional pre-processing and hand-crafting. While the exact choice of representations varies across different approaches, they mainly consist of acoustic features that describe energy and pitch information of the current word or syllable and its surrounding context, as well as temporal features, such as the duration of syllables, words or pauses.

In addition to other supervised machine learning methods such as ensemble learning (Sun, 2002, Schweitzer and Möbius, 2009), neural networks have become a popular approach to detecting prosodic events from speech data (Rosenberg et al., 2015, Wang et al., 2016, Li et al., 2018). The use of such methods is motivated by the notion of letting the model learn the best feature representation on its own. This often produces better results, since manually computed features may not always be the optimal choice for a given task. This is because there may be latent but useful information in the data that is not captured by these features. Neural networks are trained to automatically find suitable representations that lead to better classification decisions.

We have previously proposed a method of using a convolutional neural network (CNN) as an efficient and strong modeling technique to detect pitch accents and intonational phrase boundaries at the word level (Stehwien and Vu, 2017). In contrast to other methods that aim to provide a fine-grained labeling of prosody for linguistic analyses (e.g. pitch accent shapes at the syllable-level, Schweitzer and Möbius, 2009), this method is suitable for application in speech processing pipelines since it is readily implemented1 and requires very little pre-processing. The only segmental information required is time-alignment at the word level. The speech signal is represented by a small set of frame-based acoustic descriptors that can be efficiently extracted using signal processing tools.

The CNN takes these comparably “lower-level” acoustic features as input and learns a “higher-level” feature representation that is used to classify words as carrying either a prosodic event or not. The resulting feature representation was learned automatically from the acoustic input features. However, we do not know exactly what information has been learned. Following the increasing popularity and success of neural networks in the field of speech and language processing, the question of what these models are learning is currently attracting more and more interest (notable recent venues are e.g. Interspeech 2018 special session “Deep neural networks: how can we interpret what they have learned?”, and the BlackboxNLP workshop at EMNLP 2018 and ACL 2019).

We have previously introduced a method of analyzing the learned output representations of the CNN (Stehwien et al., 2019) and used this approach to show that the CNN can learn duration information for detecting pitch accents and lexical stress. In this study, we investigated how different methods of max pooling in the CNN, and zero padding of the input, affect how word or syllable duration is learned for various pitch accent and lexical stress detection tasks. The results showed that as long as there is a correlation between duration and the target label (e.g. a pitch accent), the output representation of the CNN can be found to encode duration. We also demonstrated that, while the detection performance is similar, using a 3-max pooling over a 1-max pooling technique captures more context information, and that the duration of the current word appears to be most relevant for both methods.

In this paper, we address further questions on what information the CNN has learned using the same method. Specifically, we analyze what information is latent in the high-level output representation. We describe experiments on three different corpora and show results for pitch accent and phrase boundary detection. Along with our previous publication, this is the first study to utilize this type of analysis for prosodic event detection. Building on our previous findings, we focus on temporal information, e.g. word duration. We also extend our investigation of the role of context information in the CNN (first discussed in terms of performance in Stehwien and Vu, 2017) to the question of what information from the surrounding acoustic context is most important for the CNN-based prosodic event detector.

Our overall aim in this work is, therefore, to gain insight into what the CNN is learning. This is an important question since even rather standard neural network architectures, as the ones used in this paper, produce representations that are still not readily interpreted. The fact that the chosen architecture is comparably simple makes it efficient to train and more readily analyzed. Furthermore, while there is a substantial amount of previous work on prosodic event detection that focuses on the role of various cues, such as duration, context and different acoustic features, this work does not introduce new features, but rather discusses their role in a convolutional neural network and addresses the question of what a neural network would learn when given a frame-based representation of the speech signal.

To provide a comparison of how this method of prosodic event detection performs compared to other methods, we report not only within-corpus, but also cross-corpus results on two English datasets from different speech genres as well as one German corpus. We show that the performance of this method can be compared to, and, in some cases, outperforms that of related work.

The main contribution of this paper is organized into three parts: First, we investigate the role of different types of temporal information in the CNN, namely word duration as a feature in the model and the use of pause transcriptions as a pre-processing step. We assume that the CNN can learn much of this information on its own, since word duration and the presence of pauses are implicitly included as part of the frame-based acoustic input. We report quantitative results of experiments comparing the CNN to a model that does not have access to this information; a simple feed-forward neural network (FFNN) with acoustic features that are aggregated across entire words. The results show that the FFNN can benefit from adding temporal information manually, while the CNN does not, which supports our above assumption.

Next we apply our method of analyzing what the CNN has learned. We conduct a linear regression analysis between the high-level feature representation created by the CNN and a set of aggregated features. We apply this method to estimate how much of the information represented in the aggregated features is latent in, or correlated with, the CNN output. We use this as an indication of what the CNN is learning. The results in this paper provide further evidence that the CNN can learn duration on its own, and that it can learn to locate relevant voiced regions in speech for detecting pitch accents and phrase boundaries.

In our final analysis, we take a closer look at the amount of acoustic and temporal context information that is learned by the CNN and compare it to a long short-term memory network (LSTM), which, as a sequential model, is specifically designed to take context into account (which sets it apart from a CNN). We found that while there are differences in how much context information is learned by the different network types, the most important acoustic and temporal features in both networks pertain to the current word.

This paper is organized as follows: In the next section, we provide an overview of related work. We describe the CNN model and discuss its performance in Section 3. In Section 4, we investigate the role of temporal information in the CNN-based prosodic event detector. Our method of analyzing the CNN output representations and the our results are described in Section 5. Section 6 contains the comparative analysis of the CNN and LSTM. We discuss the results of the analysis in Section 7. Section 8 concludes the paper.

Section snippets

Related work

An early example of using neural networks to locate pitch accents and phrase boundaries using frame-based features was presented by Taylor (1995). More recently, Rosenberg et al. (2015) used recurrent neural networks (RNN) for prominence detection. They used several pre-computed features that were aggregated over entire words as input to an RNN that processes and labels entire word sequences. Li et al. (2018) used multi-distribution neural networks for lexical stress and pitch accent detection

Data

We used three different speech corpora in these experiments; two American English corpora and one German corpus.

The first English corpus is the Boston University Radio News Corpus (BURNC) (Ostendorf et al., 1995). It is one of the most widely used corpora for research on automatic prosodic labeling. This corpus consists of recordings of radio news broadcasts, read by professional speakers. The recordings were orthographically transcribed, automatically force-aligned and manually corrected. In

The role of temporal information

As a first analysis using quantitative measures, we investigated how temporal information, namely word duration and the presence of pauses, affects the performance of the CNN.

Apart from acoustic cues, pitch accents and phrase boundaries have temporal correlates: Pitch accents often occur on content words, which tend to be longer, and prominence itself can cause lengthening of certain segments. Phrase boundaries often occur before pauses and breaks, and cause pre-boundary lengthening at the

Analysis of acoustic feature representations

In this section, we describe our method of analyzing the latent information encoded in the high-level representations learned by the CNN. We have previously described this method for finding evidence of duration information in the neural network output (Stehwien et al., 2019). In this paper, we take a closer look at the acoustic representations encoded in the CNN output and also include a “reverse” analysis, which we describe in the following.

The role of context information: convolutional vs. sequential models

In previous work, we discussed the role of context information in combination with the position indicators in terms of performance (Stehwien and Vu, 2017) and in terms of how much context word duration is learned using different pooling methods in Stehwien et al. (2019). We found that the method of padding the input matrix and then performing 1-max pooling makes it necessary to add position indicators that mark which frames pertain to the context or current word, and that therefore, the CNN

Discussion

The method of analyzing the output of a neural network using linear regression provides a rough estimation about what information is latent in the CNN output features. It can be used to test hypotheses by measuring correlations between manually defined candidate features and the learned output. We believe that the fact that many results were in line with our expectations also supports our position that this is a suitable and readily implemented method of analyzing what a neural network has

Conclusion

In this paper, we investigated what information is encoded in the learned feature representations of a convolutional neural network (CNN) that is trained to detect prosodic events. This method yields good results on English and German data using frame-based acoustic descriptors as features. The primary focus of this paper was the role of the different acoustic features and of temporal information in the CNN. We compared the CNN to a feed-forward neural network (FFNN) as a model that does not

CRediT authorship contribution statement

Sabrina Stehwien: Conceptualization, Methodology, Investigation, Software, Visualization, Writing - original draft, Writing - review & editing. Antje Schweitzer: Supervision, Methodology, Writing - review & editing. Ngoc Thang Vu: Conceptualization, Methodology, Supervision, Resources, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors would like to thank the anonymous reviewers for their helpful comments. This work was funded by the Sonderforschungsbereich (collaborative research center) SFB 732, project A8, of the German National Science Foundation (DFG) .

References (39)

  • LiKun et al.

    Automatic lexical stress and pitch accent detection for L2 English speech using multi-distribution deep neural networks

    Speech Commun.

    (2018)
  • AdiYossi et al.

    Analysis of sentence embedding models using prediction tasks in natural language processing

    IBM J. Res. Dev.

    (2017)
  • Batliner, Anton, Buckow, Jan, Huber, Richard, Warnke, Volker, Nöth, Elmar, Niemann, Heinrich, 2001a. Boiling down...
  • BatlinerAnton et al.

    Prosodic models, automatic speech understanding, and speech synthesis: Towards the common ground

  • Batliner, Anton, Nöth, Elmar, Buckow, Jan, Huber, Richard, Warnke, Volker, Niemann, Heinrich, 2001c. Duration features...
  • CholletFrançois

    Keras

    (2015)
  • EckartKerstin et al.

    A discourse information radio news database for linguistic analysis

  • Eyben, Florian, Weninger, Felix, Groß, Florian, Schuller, Björn, 2013. Recent developments in openSMILE, the Munich...
  • Hirschberg, Julia, Nakatani, Christine H., 1996. A prosodic analysis of discourse segments in direction-giving...
  • Kakouros, S., Suni, A., Simko, J., Vainio, M., 2019. Prosodic representations of prominence classification neural...
  • KingmaDiederik P. et al.

    Adam: A method for stochastic optimization

    (2017)
  • KompeRalf

    Prosody in Speech Understanding Systems

    (1997)
  • KumarVivek et al.

    Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework

    IEEE/ACM Trans. Audio Speech Lang. Process.

    (2008)
  • Levow, Gina-Anne, 2005. Context in multi-lingual tone and pitch accent recognition. In: Proceedings of Interspeech, pp....
  • MayerJörg

    Transcription of German intonation. The Stuttgart system

    (1995)
  • Nenkova, A., Brenier, J., Kothari, A., Calhoun, S., Whitton, L., Beaver, D., Jurafsky, D., 2007. To memorize or to...
  • OstendorfMari et al.

    The Boston University Radio News CorpusTechnical Report ECS-95-001

    (1995)
  • R Core TeamMari

    R: A Language and Environment for Statistical Computing

    (2013)
  • Ren, K., Kim, S. -S., Hasegawa-Johnson, M., Cole, J., 2004. Speaker-independent automatic detection of pitch accent....
  • Cited by (0)

    View full text