A study of continuous space word and sentence representations applied to ASR error detection
Introduction
Recent advances in the field of speech processing have led to significant improvements in speech recognition performance. However, recognition errors are still unavoidable, whatever the quality of the ASR system. This reflects the system sensitivity to variability, e.g., to acoustic conditions, speaker, language style, etc. These errors may have a considerable impact on applications based on the use of automatic transcriptions, like information retrieval, speech-to-speech translation, spoken language understanding, etc.
Error detection aims at improving the exploitation of ASR outputs by downstream applications, but it is a difficult task because there are several types of errors, which can range from simple mistakes on word morphology, such as number agreement, to the insertion of irrelevant words, which affect the overall understanding of the word sequence.
For two decades, many studies have focused on the ASR error detection task. Usually, the best ASR error detection systems are based on the use of Conditional Random Fields (CRF) (Lafferty et al., 2001). In Parada et al. (2010), the authors detect error regions generated by Out Of Vocabulary (OOV) words. They propose an approach based on a CRF tagger, which takes into account contextual information from neighboring regions instead of considering only the local region of OOV words. A similar approach for other kinds of ASR errors is presented in Béchet and Favre (2013): the authors propose an error detection system based on a CRF tagger using various ASR-derived, lexical and syntactic features.
Recent approaches leverage neural network classifiers. A neural network trained to locate errors in an utterance using a variety of features is presented in Yik-Cheung et al. (2014). Some of these features are gathered from forward and backward recurrent neural network language models in order to capture long distance word context within and across previous utterances. The other features are extracted from two complementary ASR systems. In Jalalvand and Falavigna, authors propose to use a neural network classifier furnished by stacked auto-encoders (SAE), that helps to learn the error word representations. In Ogawa, Hori, 2015, Ogawa, Hori, 2017, the authors investigated three types of ASR error detection tasks, e.g. confidence estimation, out-of-vocabulary word detection and error type classification (insertion, substitution or deletion), based on deep bidirectional recurrent neural networks.
In our previous researches Ghannay, Estève, Camelin, Dutrey, Santiago, Adda-Decker, 2015, Ghannay, Estève, Camelin, Deleglise, 2016, Ghannay, Estève, Camelin, Deléglise, 2016 we studied the use of several types of continuous word representations. In Ghannay et al. (2015b), we proposed a neural approach to detect errors in automatic transcriptions, and to calibrate confidence measures provided by an ASR system. In addition, we studied different word embeddings combination approaches in order to take benefit from their complementarity. The proposed ASR error detection system integrates several information sources: syntactic, lexical, ASR-based features, prosodic features as well as linguistic embeddings.
We proposed as well to enrich our ASR error detection system with acoustic information which is obtained through acoustic embeddings. We showed in Ghannay et al. (2016b) that acoustic word embeddings capture additional information about word pronunciation in comparison to the information supported by their spelling. We showed that these acoustic embeddings are better than orthographic embeddings to measure the phonetic proximity between two words. Moreover, the use of these acoustic embeddings in addition to other features improved the performance of the proposed ASR error detection system Ghannay et al. (2016a).
In this paper, we first propose a summary of our previous studies, we report:
- •
the performance obtained by the combined linguistic embeddings
- •
the approach we used to build acoustic embeddings
- •
and the evaluation of the combination of linguistic and acoustic embeddings in the framework of the ASR error detection task
Then, we present new contributions on the combination of prosodic features and acoustic embeddings, and about sentence embeddings to characterize the level of reliability of entire recognition hypotheses in order to better predict erroneous words. Finally, in order to show that results presented on these experiments are portable on current state of the art ASR systems, we also present, results applied on the outputs produced by a Kaldi-based TDNN/HMM ASR system (Povey, Ghoshal, Boulianne, Burget, Glembek, Goel, Hannemann, Motlicek, Qian, Schwarz, et al., 2011, Peddinti, Povey, Khudanpur, 2015).
The paper is organized as follows. Section 2 presents our ASR error detection system based on a neural architecture. This system is designed to be used with word embeddings as part of the input features: different types of word embeddings are used and each one is examined alone on the ASR error detection task. Section 3 recalls the performance of the simple and combined linguistic embeddings and their comparison to a state of the art approach. The description of the approach we used to build the acoustic embeddings, and the experimental results that concern the evaluation of acoustic embeddings, as well as the impact of their combination with prosodic features, are reported in Section 4. Then, Section 5 presents the study of modeling recognition errors at the sentence level. Finally, Section 5.5 presents the application of the proposed approach on the outputs produced by a Kaldi-based TDNN/HMM ASR system,just before the conclusion.
Section snippets
ASR error detection system
The proposed error detection system has to attribute the label correct or error to each word in the ASR transcript. Each decision is based on a set of heterogeneous features. In our approach, this classification is performed by analyzing each recognized word within its context. The context window size used in this study is two on each side of the current word.
This system is based on a feed-forward neural network and it is designed to be fed by different kinds of features, including word
Linguistic word embeddings
Different approaches have been proposed to build linguistic word embeddings through neural networks. These approaches can differ in the type of architecture and the data used to train the model. Hence, they can capture different types of information: semantic, syntactic, etc.
In our previous studies Ghannay, Estève, Camelin, Dutrey, Santiago, Adda-Decker, 2015, Ghannay, Favre, Estève, Camelin, 2016, we evaluated different kinds of word embeddings, including:
- •
Skip-gram: This architecture is
Acoustic word embeddings
Until now, we experimented with several information sources: syntactic, lexical, ASR-based features. However, we did not yet investigate the use of acoustic information. One issue about representing such information is the need of a fixed size representation to be injected in the same way as used to inject the other information sources at the word level in our neural architecture. Acoustic word embeddings are an interesting solution to get a fixed length vector representation. Acoustic word
Global decision: sentence embeddings
In this section, we focus on the integration of global information to enrich our ASR error detection system, through the use of sentence embeddings (Sent-Emb). These representations have been successfully used in sentence classification and sentiment analysis tasks (Le, Mikolov, 2014, Lin, Lei, Wu, Li, Tang, Wei, Qin, Yang, Liu, Zhou, 2016). Sentence embeddings can be built in a general context by using the tool Doc2vec (Le and Mikolov, 2014), or they can be adapted to a specific task like for
Conclusions and future work
This paper presents a study that focuses on the use of different types of continuous representations applied to the ASR error detection task. An important objective in this task is to locate possible linguistic or/and acoustic incongruities in automatic transcriptions. For this, we focused on the use of different types of embeddings that are able to capture information from different levels: linguistic word embeddings, acoustic word embeddings, and sentence embeddings.
Experiments, that were
CRediT authorship contribution statement
Sahar Ghannay: Conceptualization, Methodology, Investigation, Software, Writing - original draft. Yannick Estève: Conceptualization, Investigation, Supervision, Project administration, Resources, Funding acquisition, Writing - review & editing. Nathalie Camelin: Investigation, Methodology, Validation, Resources, Writing - review & editing.
Declaration of Competing Interest
The authors declare that this work has no conflict of interest.
Acknowledgements
This work was partially funded by the European Commission through the EUMSSI project, under the contract number 611057, in the framework of the FP7-ICT-2013-10 call. This work was also partially funded by the French National Research Agency (ANR) through the VERA project, under the contract number ANR-12-BS02-006-01.
References (47)
- et al.
Conditional random fields: probabilistic models for segmenting and labeling sequence data
Proceedings of the Eighteenth International Conference on Machine Learning
(2001) - et al.
Distributed representations of sentences and documents
ICML
(2014) - et al.
A topic-enhanced word embedding for Twitter sentiment classification
Inf. Sci.
(2016) - et al.
Wsabie: scaling up to large vocabulary image annotation
IJCAI
(2011) - et al.
Contributions du traitement automatique de la parole à l’étude des voyelles orales du Français
Traitement Autom. Langues
(2008) - et al.
ASR error segment localization for spoken recovery strategy
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013
(2013) - et al.
Word embeddings for speech recognition
Interspeech
(2014) - et al.
Praat, a system for doing phonetics by computer
Glot Int.
(2001) - et al.
Improvements to the LIUM French ASR system based on CMU Sphinx: what helps to significantly reduce the word error rate?
Interspeech
(2009) - et al.
The EPAC corpus: manual and automatic annotations of conversational speech in French broadcast news
LREC
(2010)
Integration of word and semantic features for theme identification in telephone conversations
6th International Workshop on Spoken Dialog Systems (IWSDS 2015)
Posterior probability decoding, confidence estimation and system combination
Proc. Speech Transcription Workshop
The ESTER phase II evaluation campaign for the rich transcription of French Broadcast News
Interspeech
The ESTER 2 evaluation campaign for the rich transcription of French radio broadcasts
Interspeech
Transcription de la parole conversationnelle
Traitement Autom. Langues
Word embeddings combination and neural networks for robustness in ASR error detection
European Signal Processing Conference (EUSIPCO 2015)
Acoustic word embeddings for ASR error detection
Interspeech 2016
Evaluation of acoustic word embeddings
Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP
Combining continous word representation and prosodic features for ASR error prediction
3rd International Conference on Statistical Language and Speech Processing (SLSP 2015)
Word embedding evaluation and combination
10th Edition of the Language Resources and Evaluation Conference (LREC 2016)
Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates
Speech Commun.
The ETAPE corpus for the evaluation of speech-based TV content processing in the French language
Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12)
Prosodic and other cues to speech recognition failures
Speech Commun.
Cited by (5)
Evaluating and Improving Automatic Speech Recognition using Severity
2023, Proceedings of the Annual Meeting of the Association for Computational LinguisticsARoBERT: An ASR Robust Pre-Trained Language Model for Spoken Language Understanding
2022, IEEE/ACM Transactions on Audio Speech and Language ProcessingAdaptive listening difficulty detection for L2 Learners through moderating ASR resources
2021, Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Sahar Ghannay moved to LIMSI (Université Paris-Saclay). Yannick Estève moved to LIA (Avignon Université).