Neural candidate-aware language models for speech recognition

doi:10.1016/j.csl.2020.101157

Computer Speech & Language

Volume 66, March 2021, 101157

https://doi.org/10.1016/j.csl.2020.101157 Get rights and content

Abstract

This paper presents novel neural network based language models that can correct automatic speech recognition (ASR) errors by using speech recognizer outputs as a context. Our proposed models, called neural candidate-aware language models (NCALMs), estimate the generative probability of a target sentence while considering ASR outputs including hypotheses and their posterior probabilities. Recently, neural network language models have achieved great success in ASR field because of their ability to learn long-range contexts and model the word representation in continuous space. However, they estimate a sentence probability without considering other candidates and their posterior probabilities, even though the competing hypotheses are available and include important information to increase the speech recognition accuracy. To overcome this limitation, our idea is to utilize ASR outputs in both the training phase and the inference phase. Our proposed models are conditional generative models consisting of a Transformer encoder and a Transformer decoder. The encoder embeds the candidates as context vectors and the decoder estimates a sentence probability given the context vectors. We evaluate the proposed models in Japanese lecture transcription and English conversational speech recognition tasks. Experimental results show that a NCALM has better ASR performance than a system including a deep neural network-hidden Markov model hybrid system. We further improve ASR performance by using a NCALM and a Transformer language model simultaneously.

Introduction

Language models are one of the essential components in various natural language processing (NLP) tasks such as automatic speech recognition (ASR) and machine translation. Their role is to assign a probability to a sequence of words. N-gram based approaches are the most famous techniques for language modeling (Goodman, 2001). They calculate the probability on the basis of the assumption that the occurrence of a word in the sequence depends only on the occurrence of previous $N - 1$ words. Though they have been used in many NLP applications, they cannot handle longer contexts and word similarities. Neural network language models (NNLMs) including recurrent NNLMs (RNNLMs) (Bengio, Ducharme, Vincent, Janvin, 2003, Schwenk, 2007, Mikolov, Karafiát, Burget, Cernocký, Khudanpur, 2010, Kombrink, Mikolov, Karafiát, Burget, 2011) have been developed to represent a word as a low dimensional vector in a continuous space. This enables us to utilize the word similarities while the language models calculate the probabilities of words. In the case of RNNLMs, a RNN can learn longer context information of word sequences than N-gram language models. To enhance the ability of learning longer contexts, long short-term memory RNNLMs (LSTMLMs) (Sundermeyer et al., 2012) have been proposed. Furthermore, recent methods have performed impressively on language modeling tasks by using the Transformer component to enhance these abilities (Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, 2017, Dai, Yang, Yang, Carbonell, Le, Salakhutdinov, 2019, Al-Rfou, Choe, Constant, Guo, Jones, 2019, Irie, Zeyer, Schlüter, Ney, 2019).

Recent studies on language models in ASR introduced these NNLMs into the first-pass decoding (Huang, Zweig, Dumoulin, 2014, Beck, Zhou, Schlüter, Ney, 2019). On the other hand, NNLMs are often introduced into two-pass decoding in the ASR systems in many studies. In the first decoding pass, hypotheses are generated using n-gram language models that are compatible with decoding algorithms. In the second or later decoding passes, NNLMs including RNNLMs and Transformer language models are applied for rescoring the hypotheses. They have significantly improved over the systems based on n-gram language models.

Existing rescoring models calculate scores of the possible candidate sentences, i.e., the hypotheses from the speech recognizer. The ASR hypotheses include various ASR errors. They are assumed to be useful for estimating the probabilities of words for rescoring in ASR. However, these models calculate scores of each candidate sentence without using competing candidates or posterior probabilities.

To overcome the limitation, this paper proposes neural candidate-aware language models (NCALMs), which directly utilize ASR outputs as contexts for estimating generative probabilities of words. Our proposed models are composed of a Transformer encoder and a Transformer decoder. The encoder network embeds the candidates including hypotheses and their posterior probabilities into context vectors. Then, the decoder network computes a generative probability of a word sequence given the context vectors. To find the effective hypotheses embedded in the encoder, we present several modeling methods that utilize single and multiple ASR hypotheses obtained by a speech recognizer. As a method to use multiple ASR hypotheses, the encoder embeds a confusion network into continuous representation (Masumura et al., 2018).

NCALMs are one of the conditional neural network language models. Mikolov et al. proposed context dependent RNNLMs with an auxiliary feature vector generated from latent Dirichlet allocation to enhance context information (Mikolov, Zweig, 2012, Chen, Tan, Liu, Lanchantin, Wan, Gales, Woodland, 2015). Bag-of-words representation is utilized by following the same motivation (Irie et al., 2015). Shi et al. employed part-of-speech tags and conversation related information (Shi et al., 2012). Our proposed models directly utilize embedding of ASR hypotheses and their posterior probabilities as an additional feature.

NCALMs are related to discriminative training for language models. The most famous language models with discriminative training are discriminative language models (DLMs) (Chen, Lee, Li, 2000, Xu, Khudanpur, Lehr, Prud’hommeaux, Glenn, Karakos, Roark, Sagae, Saraclar, Shafran, Bikel, Callison-Burch, Cao, Hall, Hasler, Koehn, Lopez, Post, Riley, 2012, Roark, Saraclar, Collins, Johnson, 2004, Oba, Hori, Nakamura, Ito, 2012). DLMs train the model parameters with a discriminative criterion to explicitly consider the relation between ASR hypotheses and their transcription. This discriminative criterion was also applied to optimizing parameters of RNNLMs (Tachioka, Watanabe, 2015, Hori, Hori, Watanabe, Hershey, 2016). In the inference step, these models just calculate the probability of a word sequence without using any other candidates including ASR errors. NCALMs utilize ASR hypotheses for not only training but also inference.

Our proposed models are similar to spell correction models (Guo, Sainath, Weiss, 2019, Zhang, Lei, Yan, 2019) which generate a sequence to correct the output of End-to-End ASR system. In contrast to this study, our proposed models rescore N-best list instead of generation and support not only single sentence input but also confusion network input. In addition, our proposed models do not require a complementary data for training to produce improvements.

We evaluate NCALMs on ASR tasks with Japanese lectures in the Corpus of Spontaneous Japanese (CSJ) (Furui et al., 2000) and an English conversations in Switchboard (SWBD) 300 h (Godfrey et al., 1992). The results verify that NCALMs have better ASR performance than the conventional ASR system with a Transformer language model.

This paper extends our previous work (Tanaka et al., 2018). We use Transformer encoder decoder models instead of RNN based models following the recent studies. When our models encode multiple hypotheses, a confusion network is embedded into continuous representations directly. Furthermore, we conduct additional experimental evaluations that reveals the properties of NCALMs.

This paper is organized as follows. Section 2 explains the Transformer language model that is compared with proposed models. Section 3 details NCALMs. Experiments are shown in Sections 4 and 5 concludes the paper and mentions future work.

Section snippets

Transformer language models

This section describes Transformer language models. We use autoregressive models for this paper. Fig. 1 shows an example of a Transformer language model composed of stacked transformer blocks that have self-attention and feed-forward networks.

Transformer language models estimate generative probabilities of a sequence of words $w = {w_{1}, w_{2}, \dots, w_{i}, \dots, w_{I}}$ . In Transformer language models, each word $w_{i}$ is mapped to 1-of-K representation and embedded in a continuous representation by affine transformation as $d$

Framework

This section details the definition of candidate-aware language models (CALMs). The CALMs are language models that use speech recognizer output as a context for estimating generative probabilities of words. Given speech recognizer output H(x; θ) with input speech x, the generative probability of an input word sequence $w = {w_{1}, w_{2}, \dots, w_{I}}$ is written as $P (w | x) = P (w | H (x; θ); Λ),$ where θ denotes parameters in a speech recognizer and Λ represents parameters of CALMs. The speech recognizer includes acoustic

Task and data

We evaluated the NCALMs with ASR tasks of Japanese lecture speech from CSJ and English conversational speech from SWBD of 300 h data. Datasets for them are shown in Table 1. The evaluation set for CSJ was three standard sets (eval1, eval2, eval3). The sets of eval1 and eval2 are composed of actual academic presentations and eval3 is composed of simulated presentations. Evaluation set for SWBD was switchboard (swbd) and CallHome (callhome) portions from the NIST Hub5 2000.

In our experiments of

Conclusions

In this paper, we proposed NCALMs, which directly utilizes ASR outputs as contexts to estimate generative probabilities of words. We defined and formulated NCALMs as conditional generative models with neural networks. We investigated hypotheses used for constructing context in the encoder to find better ones for our proposed models. Single-hypothesis based modeling and confusion network based modeling were compared. In experiments, the best WER in single-hypothesis based modeling was achieved

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (38)

Z. Chen et al.
Discriminative training on language model
Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH)
(2000)
J.T. Goodman
A bit of progress in language modeling
Comput. Speech Lang.
(2001)
H. Schwenk
Continuous space language models
Comput. Speech Lang.
(2007)
R. Al-Rfou et al.
Character-level language modeling with deeper self-attention
Proceedings of the Conference on Artificial Intelligence (AAAI)
(2019)
L.J. Ba et al.
Layer normalization
CoRR.
(2016)
E. Beck et al.
LSTM language models for LVCSR in first-pass decoding and lattice-rescoring
CoRR.
(2019)
Y. Bengio et al.
A neural probabilistic language model
J. Mach. Learn. Res.
(2003)
X. Chen et al.
Recurrent neural network language model adaptation for multi-genre broadcast speech recognition
Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH)
(2015)
Z. Dai et al.
Transformer-XL: attentive language models beyond a fixed-length context
Proceedings of the Conference of the Association for Computational Linguistics (ACL)
(2019)
S. Furui et al.
A Japanese national project on spontaneous speech corpus and processing technology
Proceedings of the ASR2000 - Automatic Speech Recognition: Challenges for the new Millenium
(2000)

L. Gillick et al.

Some statistical issues in the comparison of speech recognition algorithms

Proceedings of the International Conference on Speech, and Signal Processing (ICASSP)

(1989)

J. Godfrey et al.

Switchboard: telephone speech corpus for research and development

Proceedings of the International Conference on Speech, and Signal Processing (ICASSP)

(1992)

J. Guo et al.

A spelling correction model for end-to-end speech recognition

Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)

(2019)

K. He et al.

Deep residual learning for image recognition

Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)

(2016)

T. Hori et al.

Minimum word error training of long short-term memory recurrent neural network language models for speech recognition

Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)

(2016)

Z. Huang et al.

Cache based recurrent neural network language model inference for first pass speech recognition

Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)

(2014)

K. Irie et al.

Bag-of-words input for long history representation in neural network-based language models for speech recognition

Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH)

(2015)

K. Irie et al.

Language modeling with deep transformers

Proceedings of the International Speech Communication Association (INTERSPEECH)

(2019)

D.P. Kingma et al.

Adam: a method for stochastic optimization

Proceedings of the International Conference on Learning Representations (ICLR)

(2015)

Cited by (5)

Big Data and AI-Driven Product Design: A Survey
2023, Applied Sciences (Switzerland)
Big Data and AI-driven Product Design: A Survey
2023, Research Square
Spontaneous Speech and Its Features Are Taken into Account When Creating Recognition Programs
2023, Ingenierie des Systemes d'Information
Efficient Speech Emotion Recognition for Resource-Constrained Devices
2023, Proceedings of the 2023 2nd International Conference on Augmented Intelligence and Sustainable Systems, ICAISS 2023
Semi-supervised training of transformer and causal dilated convolution network with applications to topic classification
2021, Applied Sciences (Switzerland)

View full text

Neural candidate-aware language models for speech recognition

Abstract

Introduction

Section snippets

Transformer language models

Framework

Task and data

Conclusions

Declaration of Competing Interest

Comput. Speech Lang.

Comput. Speech Lang.

Character-level language modeling with deeper self-attention

Proceedings of the Conference on Artificial Intelligence (AAAI)

Layer normalization

CoRR.

LSTM language models for LVCSR in first-pass decoding and lattice-rescoring

CoRR.

A neural probabilistic language model

J. Mach. Learn. Res.

Recurrent neural network language model adaptation for multi-genre broadcast speech recognition

Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH)

Transformer-XL: attentive language models beyond a fixed-length context

Proceedings of the Conference of the Association for Computational Linguistics (ACL)

A Japanese national project on spontaneous speech corpus and processing technology

Proceedings of the ASR2000 - Automatic Speech Recognition: Challenges for the new Millenium

Some statistical issues in the comparison of speech recognition algorithms

Proceedings of the International Conference on Speech, and Signal Processing (ICASSP)

Switchboard: telephone speech corpus for research and development

Proceedings of the International Conference on Speech, and Signal Processing (ICASSP)

A spelling correction model for end-to-end speech recognition

Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Deep residual learning for image recognition

Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)

Minimum word error training of long short-term memory recurrent neural network language models for speech recognition

Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Cache based recurrent neural network language model inference for first pass speech recognition

Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Bag-of-words input for long history representation in neural network-based language models for speech recognition

Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH)

Language modeling with deep transformers

Proceedings of the International Speech Communication Association (INTERSPEECH)

Adam: a method for stochastic optimization

Proceedings of the International Conference on Learning Representations (ICLR)