Neural candidate-aware language models for speech recognition
Introduction
Language models are one of the essential components in various natural language processing (NLP) tasks such as automatic speech recognition (ASR) and machine translation. Their role is to assign a probability to a sequence of words. N-gram based approaches are the most famous techniques for language modeling (Goodman, 2001). They calculate the probability on the basis of the assumption that the occurrence of a word in the sequence depends only on the occurrence of previous words. Though they have been used in many NLP applications, they cannot handle longer contexts and word similarities. Neural network language models (NNLMs) including recurrent NNLMs (RNNLMs) (Bengio, Ducharme, Vincent, Janvin, 2003, Schwenk, 2007, Mikolov, Karafiát, Burget, Cernocký, Khudanpur, 2010, Kombrink, Mikolov, Karafiát, Burget, 2011) have been developed to represent a word as a low dimensional vector in a continuous space. This enables us to utilize the word similarities while the language models calculate the probabilities of words. In the case of RNNLMs, a RNN can learn longer context information of word sequences than N-gram language models. To enhance the ability of learning longer contexts, long short-term memory RNNLMs (LSTMLMs) (Sundermeyer et al., 2012) have been proposed. Furthermore, recent methods have performed impressively on language modeling tasks by using the Transformer component to enhance these abilities (Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, 2017, Dai, Yang, Yang, Carbonell, Le, Salakhutdinov, 2019, Al-Rfou, Choe, Constant, Guo, Jones, 2019, Irie, Zeyer, Schlüter, Ney, 2019).
Recent studies on language models in ASR introduced these NNLMs into the first-pass decoding (Huang, Zweig, Dumoulin, 2014, Beck, Zhou, Schlüter, Ney, 2019). On the other hand, NNLMs are often introduced into two-pass decoding in the ASR systems in many studies. In the first decoding pass, hypotheses are generated using n-gram language models that are compatible with decoding algorithms. In the second or later decoding passes, NNLMs including RNNLMs and Transformer language models are applied for rescoring the hypotheses. They have significantly improved over the systems based on n-gram language models.
Existing rescoring models calculate scores of the possible candidate sentences, i.e., the hypotheses from the speech recognizer. The ASR hypotheses include various ASR errors. They are assumed to be useful for estimating the probabilities of words for rescoring in ASR. However, these models calculate scores of each candidate sentence without using competing candidates or posterior probabilities.
To overcome the limitation, this paper proposes neural candidate-aware language models (NCALMs), which directly utilize ASR outputs as contexts for estimating generative probabilities of words. Our proposed models are composed of a Transformer encoder and a Transformer decoder. The encoder network embeds the candidates including hypotheses and their posterior probabilities into context vectors. Then, the decoder network computes a generative probability of a word sequence given the context vectors. To find the effective hypotheses embedded in the encoder, we present several modeling methods that utilize single and multiple ASR hypotheses obtained by a speech recognizer. As a method to use multiple ASR hypotheses, the encoder embeds a confusion network into continuous representation (Masumura et al., 2018).
NCALMs are one of the conditional neural network language models. Mikolov et al. proposed context dependent RNNLMs with an auxiliary feature vector generated from latent Dirichlet allocation to enhance context information (Mikolov, Zweig, 2012, Chen, Tan, Liu, Lanchantin, Wan, Gales, Woodland, 2015). Bag-of-words representation is utilized by following the same motivation (Irie et al., 2015). Shi et al. employed part-of-speech tags and conversation related information (Shi et al., 2012). Our proposed models directly utilize embedding of ASR hypotheses and their posterior probabilities as an additional feature.
NCALMs are related to discriminative training for language models. The most famous language models with discriminative training are discriminative language models (DLMs) (Chen, Lee, Li, 2000, Xu, Khudanpur, Lehr, Prud’hommeaux, Glenn, Karakos, Roark, Sagae, Saraclar, Shafran, Bikel, Callison-Burch, Cao, Hall, Hasler, Koehn, Lopez, Post, Riley, 2012, Roark, Saraclar, Collins, Johnson, 2004, Oba, Hori, Nakamura, Ito, 2012). DLMs train the model parameters with a discriminative criterion to explicitly consider the relation between ASR hypotheses and their transcription. This discriminative criterion was also applied to optimizing parameters of RNNLMs (Tachioka, Watanabe, 2015, Hori, Hori, Watanabe, Hershey, 2016). In the inference step, these models just calculate the probability of a word sequence without using any other candidates including ASR errors. NCALMs utilize ASR hypotheses for not only training but also inference.
Our proposed models are similar to spell correction models (Guo, Sainath, Weiss, 2019, Zhang, Lei, Yan, 2019) which generate a sequence to correct the output of End-to-End ASR system. In contrast to this study, our proposed models rescore N-best list instead of generation and support not only single sentence input but also confusion network input. In addition, our proposed models do not require a complementary data for training to produce improvements.
We evaluate NCALMs on ASR tasks with Japanese lectures in the Corpus of Spontaneous Japanese (CSJ) (Furui et al., 2000) and an English conversations in Switchboard (SWBD) 300 h (Godfrey et al., 1992). The results verify that NCALMs have better ASR performance than the conventional ASR system with a Transformer language model.
This paper extends our previous work (Tanaka et al., 2018). We use Transformer encoder decoder models instead of RNN based models following the recent studies. When our models encode multiple hypotheses, a confusion network is embedded into continuous representations directly. Furthermore, we conduct additional experimental evaluations that reveals the properties of NCALMs.
This paper is organized as follows. Section 2 explains the Transformer language model that is compared with proposed models. Section 3 details NCALMs. Experiments are shown in Sections 4 and 5 concludes the paper and mentions future work.
Section snippets
Transformer language models
This section describes Transformer language models. We use autoregressive models for this paper. Fig. 1 shows an example of a Transformer language model composed of stacked transformer blocks that have self-attention and feed-forward networks.
Transformer language models estimate generative probabilities of a sequence of words . In Transformer language models, each word is mapped to 1-of-K representation and embedded in a continuous representation by affine transformation as
Framework
This section details the definition of candidate-aware language models (CALMs). The CALMs are language models that use speech recognizer output as a context for estimating generative probabilities of words. Given speech recognizer output H(x; θ) with input speech x, the generative probability of an input word sequence is written aswhere θ denotes parameters in a speech recognizer and Λ represents parameters of CALMs. The speech recognizer includes acoustic
Task and data
We evaluated the NCALMs with ASR tasks of Japanese lecture speech from CSJ and English conversational speech from SWBD of 300 h data. Datasets for them are shown in Table 1. The evaluation set for CSJ was three standard sets (eval1, eval2, eval3). The sets of eval1 and eval2 are composed of actual academic presentations and eval3 is composed of simulated presentations. Evaluation set for SWBD was switchboard (swbd) and CallHome (callhome) portions from the NIST Hub5 2000.
In our experiments of
Conclusions
In this paper, we proposed NCALMs, which directly utilizes ASR outputs as contexts to estimate generative probabilities of words. We defined and formulated NCALMs as conditional generative models with neural networks. We investigated hypotheses used for constructing context in the encoder to find better ones for our proposed models. Single-hypothesis based modeling and confusion network based modeling were compared. In experiments, the best WER in single-hypothesis based modeling was achieved
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (38)
- et al.
Discriminative training on language model
Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH)
(2000) A bit of progress in language modeling
Comput. Speech Lang.
(2001)Continuous space language models
Comput. Speech Lang.
(2007)- et al.
Character-level language modeling with deeper self-attention
Proceedings of the Conference on Artificial Intelligence (AAAI)
(2019) - et al.
Layer normalization
CoRR.
(2016) - et al.
LSTM language models for LVCSR in first-pass decoding and lattice-rescoring
CoRR.
(2019) - et al.
A neural probabilistic language model
J. Mach. Learn. Res.
(2003) - et al.
Recurrent neural network language model adaptation for multi-genre broadcast speech recognition
Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH)
(2015) - et al.
Transformer-XL: attentive language models beyond a fixed-length context
Proceedings of the Conference of the Association for Computational Linguistics (ACL)
(2019) - et al.
A Japanese national project on spontaneous speech corpus and processing technology
Proceedings of the ASR2000 - Automatic Speech Recognition: Challenges for the new Millenium
(2000)
Some statistical issues in the comparison of speech recognition algorithms
Proceedings of the International Conference on Speech, and Signal Processing (ICASSP)
Switchboard: telephone speech corpus for research and development
Proceedings of the International Conference on Speech, and Signal Processing (ICASSP)
A spelling correction model for end-to-end speech recognition
Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Deep residual learning for image recognition
Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)
Minimum word error training of long short-term memory recurrent neural network language models for speech recognition
Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Cache based recurrent neural network language model inference for first pass speech recognition
Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Bag-of-words input for long history representation in neural network-based language models for speech recognition
Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH)
Language modeling with deep transformers
Proceedings of the International Speech Communication Association (INTERSPEECH)
Adam: a method for stochastic optimization
Proceedings of the International Conference on Learning Representations (ICLR)
Cited by (5)
Big Data and AI-Driven Product Design: A Survey
2023, Applied Sciences (Switzerland)Big Data and AI-driven Product Design: A Survey
2023, Research SquareSpontaneous Speech and Its Features Are Taken into Account When Creating Recognition Programs
2023, Ingenierie des Systemes d'InformationEfficient Speech Emotion Recognition for Resource-Constrained Devices
2023, Proceedings of the 2023 2nd International Conference on Augmented Intelligence and Sustainable Systems, ICAISS 2023Semi-supervised training of transformer and causal dilated convolution network with applications to topic classification
2021, Applied Sciences (Switzerland)