Elsevier

Speech Communication

Volume 120, June 2020, Pages 31-41
Speech Communication

A study of continuous space word and sentence representations applied to ASR error detection

https://doi.org/10.1016/j.specom.2020.03.002Get rights and content

Highlights

  • Experimental results of linguistic, signal, and acoustic word embeddings combined to prosodic features.

  • Experimental results show how much acoustic embeddings and prosodic features are very complementary to detect errors.

  • A new task-dedicated sentence embedding helps to improve the ASR error detection quality.

  • The task-dedicated sentence embedding outperforms a famous generic sentence embedding.

  • A comparison of our feed forward Multi-Layer Perceptron Multi Stream neural architecture to a bidirectional recurrent neural network.

Abstract

This paper presents a study of continuous word representations applied to automatic detection of speech recognition errors. A neural network architecture is proposed, which is well suited to handle continuous word representations, like word embeddings. We explore the use of several types of word representations: simple and combined linguistic embeddings, and acoustic ones associated to prosodic features, extracted from the audio signal. To compensate certain phenomena highlighted by the analysis of the error average span, we propose to model the errors at the sentence level through the use of sentence embeddings. An approach to build continuous sentence representations dedicated to ASR error detection is also proposed and compared to the Doc2vec approach. Experiments are performed on automatic transcriptions generated by the LIUM ASR system applied to the French ETAPE corpus. They show that the combination of linguistic embeddings, acoustic embeddings, prosodic features, and sentence embeddings in addition to more classical features yields very competitive results. Particularly, these results show the complementarity of acoustic embeddings and prosodic information, and show that the proposed sentence embeddings dedicated to ASR error detection achieve better results than generic sentence embeddings.

Introduction

Recent advances in the field of speech processing have led to significant improvements in speech recognition performance. However, recognition errors are still unavoidable, whatever the quality of the ASR system. This reflects the system sensitivity to variability, e.g., to acoustic conditions, speaker, language style, etc. These errors may have a considerable impact on applications based on the use of automatic transcriptions, like information retrieval, speech-to-speech translation, spoken language understanding, etc.

Error detection aims at improving the exploitation of ASR outputs by downstream applications, but it is a difficult task because there are several types of errors, which can range from simple mistakes on word morphology, such as number agreement, to the insertion of irrelevant words, which affect the overall understanding of the word sequence.

For two decades, many studies have focused on the ASR error detection task. Usually, the best ASR error detection systems are based on the use of Conditional Random Fields (CRF) (Lafferty et al., 2001). In Parada et al. (2010), the authors detect error regions generated by Out Of Vocabulary (OOV) words. They propose an approach based on a CRF tagger, which takes into account contextual information from neighboring regions instead of considering only the local region of OOV words. A similar approach for other kinds of ASR errors is presented in Béchet and Favre (2013): the authors propose an error detection system based on a CRF tagger using various ASR-derived, lexical and syntactic features.

Recent approaches leverage neural network classifiers. A neural network trained to locate errors in an utterance using a variety of features is presented in Yik-Cheung et al. (2014). Some of these features are gathered from forward and backward recurrent neural network language models in order to capture long distance word context within and across previous utterances. The other features are extracted from two complementary ASR systems. In Jalalvand and Falavigna, authors propose to use a neural network classifier furnished by stacked auto-encoders (SAE), that helps to learn the error word representations. In Ogawa, Hori, 2015, Ogawa, Hori, 2017, the authors investigated three types of ASR error detection tasks, e.g. confidence estimation, out-of-vocabulary word detection and error type classification (insertion, substitution or deletion), based on deep bidirectional recurrent neural networks.

In our previous researches Ghannay, Estève, Camelin, Dutrey, Santiago, Adda-Decker, 2015, Ghannay, Estève, Camelin, Deleglise, 2016, Ghannay, Estève, Camelin, Deléglise, 2016 we studied the use of several types of continuous word representations. In Ghannay et al. (2015b), we proposed a neural approach to detect errors in automatic transcriptions, and to calibrate confidence measures provided by an ASR system. In addition, we studied different word embeddings combination approaches in order to take benefit from their complementarity. The proposed ASR error detection system integrates several information sources: syntactic, lexical, ASR-based features, prosodic features as well as linguistic embeddings.

We proposed as well to enrich our ASR error detection system with acoustic information which is obtained through acoustic embeddings. We showed in Ghannay et al. (2016b) that acoustic word embeddings capture additional information about word pronunciation in comparison to the information supported by their spelling. We showed that these acoustic embeddings are better than orthographic embeddings to measure the phonetic proximity between two words. Moreover, the use of these acoustic embeddings in addition to other features improved the performance of the proposed ASR error detection system Ghannay et al. (2016a).

In this paper, we first propose a summary of our previous studies, we report:

  • the performance obtained by the combined linguistic embeddings

  • the approach we used to build acoustic embeddings

  • and the evaluation of the combination of linguistic and acoustic embeddings in the framework of the ASR error detection task

Then, we present new contributions on the combination of prosodic features and acoustic embeddings, and about sentence embeddings to characterize the level of reliability of entire recognition hypotheses in order to better predict erroneous words. Finally, in order to show that results presented on these experiments are portable on current state of the art ASR systems, we also present, results applied on the outputs produced by a Kaldi-based TDNN/HMM ASR system (Povey, Ghoshal, Boulianne, Burget, Glembek, Goel, Hannemann, Motlicek, Qian, Schwarz, et al., 2011, Peddinti, Povey, Khudanpur, 2015).

The paper is organized as follows. Section 2 presents our ASR error detection system based on a neural architecture. This system is designed to be used with word embeddings as part of the input features: different types of word embeddings are used and each one is examined alone on the ASR error detection task. Section 3 recalls the performance of the simple and combined linguistic embeddings and their comparison to a state of the art approach. The description of the approach we used to build the acoustic embeddings, and the experimental results that concern the evaluation of acoustic embeddings, as well as the impact of their combination with prosodic features, are reported in Section 4. Then, Section 5 presents the study of modeling recognition errors at the sentence level. Finally, Section 5.5 presents the application of the proposed approach on the outputs produced by a Kaldi-based TDNN/HMM ASR system,just before the conclusion.

Section snippets

ASR error detection system

The proposed error detection system has to attribute the label correct or error to each word in the ASR transcript. Each decision is based on a set of heterogeneous features. In our approach, this classification is performed by analyzing each recognized word within its context. The context window size used in this study is two on each side of the current word.

This system is based on a feed-forward neural network and it is designed to be fed by different kinds of features, including word

Linguistic word embeddings

Different approaches have been proposed to build linguistic word embeddings through neural networks. These approaches can differ in the type of architecture and the data used to train the model. Hence, they can capture different types of information: semantic, syntactic, etc.

In our previous studies Ghannay, Estève, Camelin, Dutrey, Santiago, Adda-Decker, 2015, Ghannay, Favre, Estève, Camelin, 2016, we evaluated different kinds of word embeddings, including:

  • Skip-gram: This architecture is

Acoustic word embeddings

Until now, we experimented with several information sources: syntactic, lexical, ASR-based features. However, we did not yet investigate the use of acoustic information. One issue about representing such information is the need of a fixed size representation to be injected in the same way as used to inject the other information sources at the word level in our neural architecture. Acoustic word embeddings are an interesting solution to get a fixed length vector representation. Acoustic word

Global decision: sentence embeddings

In this section, we focus on the integration of global information to enrich our ASR error detection system, through the use of sentence embeddings (Sent-Emb). These representations have been successfully used in sentence classification and sentiment analysis tasks (Le, Mikolov, 2014, Lin, Lei, Wu, Li, Tang, Wei, Qin, Yang, Liu, Zhou, 2016). Sentence embeddings can be built in a general context by using the tool Doc2vec (Le and Mikolov, 2014), or they can be adapted to a specific task like for

Conclusions and future work

This paper presents a study that focuses on the use of different types of continuous representations applied to the ASR error detection task. An important objective in this task is to locate possible linguistic or/and acoustic incongruities in automatic transcriptions. For this, we focused on the use of different types of embeddings that are able to capture information from different levels: linguistic word embeddings, acoustic word embeddings, and sentence embeddings.

Experiments, that were

CRediT authorship contribution statement

Sahar Ghannay: Conceptualization, Methodology, Investigation, Software, Writing - original draft. Yannick Estève: Conceptualization, Investigation, Supervision, Project administration, Resources, Funding acquisition, Writing - review & editing. Nathalie Camelin: Investigation, Methodology, Validation, Resources, Writing - review & editing.

Declaration of Competing Interest

The authors declare that this work has no conflict of interest.

Acknowledgements

This work was partially funded by the European Commission through the EUMSSI project, under the contract number 611057, in the framework of the FP7-ICT-2013-10 call. This work was also partially funded by the French National Research Agency (ANR) through the VERA project, under the contract number ANR-12-BS02-006-01.

References (47)

  • Y. Estève et al.

    Integration of word and semantic features for theme identification in telephone conversations

    6th International Workshop on Spoken Dialog Systems (IWSDS 2015)

    (2015)
  • G. Evermann et al.

    Posterior probability decoding, confidence estimation and system combination

    Proc. Speech Transcription Workshop

    (2000)
  • S. Galliano et al.

    The ESTER phase II evaluation campaign for the rich transcription of French Broadcast News

    Interspeech

    (2005)
  • S. Galliano et al.

    The ESTER 2 evaluation campaign for the rich transcription of French radio broadcasts

    Interspeech

    (2009)
  • J.-L. Gauvain et al.

    Transcription de la parole conversationnelle

    Traitement Autom. Langues

    (2005)
  • S. Ghannay et al.

    Word embeddings combination and neural networks for robustness in ASR error detection

    European Signal Processing Conference (EUSIPCO 2015)

    (2015)
  • S. Ghannay et al.

    Acoustic word embeddings for ASR error detection

    Interspeech 2016

    (2016)
  • S. Ghannay et al.

    Evaluation of acoustic word embeddings

    Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP

    (2016)
  • S. Ghannay et al.

    Combining continous word representation and prosodic features for ASR error prediction

    3rd International Conference on Statistical Language and Speech Processing (SLSP 2015)

    (2015)
  • S. Ghannay et al.

    Word embedding evaluation and combination

    10th Edition of the Language Resources and Evaluation Conference (LREC 2016)

    (2016)
  • S. Goldwater et al.

    Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates

    Speech Commun.

    (2010)
  • G. Gravier et al.

    The ETAPE corpus for the evaluation of speech-based TV content processing in the French language

    Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12)

    (2012)
  • J. Hirschberg et al.

    Prosodic and other cues to speech recognition failures

    Speech Commun.

    (2004)
  • Cited by (5)

    Sahar Ghannay moved to LIMSI (Université Paris-Saclay). Yannick Estève moved to LIA (Avignon Université).

    View full text