A unified system for multilingual speech recognition and language identification
Introduction
Multilingual speech recognition research has drawn increasing attention because of frequent international communication. Some studies have shown that multilingual speakers outnumber monolingual speakers in many areas (Waggoner, 1993, Baker, 2011). Therefore, multilingual speech recognition systems are in high demand. Typical multilingual automatic speech recognition (ASR) systems take advantage of parallel monolingual ASR systems to facilitate multilingual speech recognition. Therefore, it is crucial to make a correct judgment of the language identity information to ensure that the multilingual speech can be handled by the corresponding monolingual ASR back-end.
Existing multilingual ASR systems can be roughly grouped into three categories. The first category is cascade multilingual ASR systems with a language identification (LID) front-end and multiple monolingual ASR system back-ends. The second category is parallel multilingual ASR systems, in which the language information and speech content are recognized simultaneously. The third category is end-to-end multilingual ASR systems, which unify ASR and LID in the same process.
A cascade multilingual ASR system depends largely on the performance of the LID front-end (Lyu and Lyu, 2008, Barroso et al., 2010, Mabokela and Manamela, 2013). In Lin et al. (2012), to fulfill real-time LID requests and enhance the performance of the LID module, a margin was added to each LID score to compensate for possibly biased language decisions. Even when existing LID technologies improve the LID accuracy of cascade systems, the tandem mechanism inevitably causes delays, which make such systems difficult to apply in real-time ASR systems.
To reduce the response time of multilingual ASR systems, the preference is to conduct the LID and ASR processes in a parallel manner. In Gonzalez-Dominguez et al. (2015), several language-detection strategies have been investigated to reduce the overall system latency. Since LID and ASR can be performed simultaneously, parallel processing also provides the ability to perform intrasentence and intersentence multilingual ASR, where two or more languages are within the same speech stream. In Chung-Hsien Wu et al. (2006), the authors proposed an approach to segment and identify intrasentence multilingual speech. To detect the switch point of language, a dynamic programming method was used to determine language boundaries globally. Although that study was implemented in an offline manner, it provides an impetus for online multilingual ASR.
Thanks to the successful development of the end-to-end framework in the field of speech recognition, multilingual ASR can also be conducted with an end-to-end framework. In Watanabe et al. (2017), the end-to-end architecture is first applied to language-independent multilingual speech recognition. In some works, intersentence (Seki et al., 2018) multilingual speech recognition and intrasentence (Zeng et al., 2019, Luo et al., 2018) multilingual speech recognition have also been investigated under the end-to-end framework to process the ASR and LID simultaneously. Although the end-to-end system unifies ASR and LID processes, multilingual joint modeling under this framework also causes new problems. Due to the differences in pronunciation mechanisms and grammar rules, joint modeling of multiple languages may inevitably result in confusion among languages (Kannan et al., 2019). The size of the modeling unit varies among languages, which causes imbalance in the modeling unit (Irie et al., 2019). In addition, for languages such as Chinese and Japanese, the modeling unit may be excessively large (Li et al., 2019). For different languages, the same words usually have different pronunciations. Since the mapping between the character-based modeling units and the acoustic features is modeled end-to-end, the acoustic features corresponding to the same character-based modeling unit are different among languages.
Because the modeling units are balanced among languages, and the pronunciation rules and the grammatical rules are modeled separately in hidden Markov model (HMM)-based ASR systems. Compared to the end-to-end multilingual framework, the HMM-based multilingual framework is more stable. In addition, in contrast to cascade multilingual ASR systems, parallel multilingual ASR systems process the ASR and LID simultaneously and can take advantage of their complementarity. Based on the above analysis, this paper focuses on the construction of a HMM-based parallel multilingual ASR system.
In this paper, the multilingual ASR is based on the multitask framework. The shared-hidden-layers are modeled together by multiple languages, while the language-specific-layers are modeled independently by a single language. The multitask-based multilingual acoustic model can ignore the differences in language families and model multiple languages in a uniform framework while supplementing them in the acoustic modeling procedure (Veselý et al., 2012). However, the training of a shared-hidden-layer network lacks language discrimination, and it is necessary to perform language adaptive training. In Tong et al. (2017), the authors investigated several language adaptive training methods by using speaker adaptive training methods and achieved a notable improvement. In this paper, language identity information is also applied to conduct language adaptive training of the multilingual acoustic model.
LID, which is the foundation of correctly generating speech content, is another core module of the multilingual ASR system. In general, LID methods can be classified into three categories. The first and most popular method is the i-vector-based LID method, which takes advantage of language information features, called the i-vector (Dehak et al., 2011). Second is the ASR-based LID method, which leverages a series of parallel large vocabulary continuous speech recognition (LVCSR) systems and generates the LID result with an ASR confidence score. The last is the deep neural network (DNN)-based LID method, which directly models language classification with DNNs. Because LID is a long-term classification task, the i-vector-based method can provide satisfactory performance in most instances, especially when the utterance duration is sufficiently long (Martinez et al., 2011, Dehak et al., 2011). ASR-based LID methods, which can be specified as parallel phone recognition followed by language model (PPRLM) (Yonghong Yan and Barnard, 1995, Zissman, 1996) and parallel word recognition followed by language model (PWRLM) (Zissman and Berkling, 2001), take advantage of language-related pronunciation characteristics and can effectively distinguish different languages. NN-based methods are limited by the length of the context history and can only make a decision about short-term acoustic features. Such short-term characteristics of NN-based methods result in worse performance than that of the i-vector-based methods when the utterance duration is long (Lopez-Moreno et al., 2014, Gonzalez-Dominguez et al., 2014, Tang et al., 2018).
This paper solves the problem of multilingual speech recognition in two scenarios. The first is the multilingual speech recognition problem in which the speech stream contains only one language and the language identity needs to be judged at the utterance level. The second is the multilingual speech recognition problem in which multiple languages are contained in the same speech stream and the time point of language switching must be detected dynamically. The NN-based LID method is adopted to accomplish the utterance-level language judgment, and the ASR confidence score is supplied as a balance metric. This paper represents the first time that the Viterbi beam search algorithm (Viterbi, 1967, Forney, 1973) is applied to accomplish dynamic language identification with a chunk-level NN-based LID classifier. Additionally, it is the first time the ASR module and LID module are constructed in a unified architecture such that they can complement each other. The LID module contributes to the language adaptive training of the multilingual acoustic model, and the ASR decoding information acts as a confidence metric to balance the LID results. This paper is organized as follows. Section 2 describes the proposed collaborative framework for multilingual ASR and LID. Then, two types of LID applications of multilingual ASR are introduced. The details of the experimental configuration and the multilingual ASR and LID performance are shown in Section 3. Finally, a summary of this paper is presented in Section 4.
Section snippets
Multitask collaborative framework for multilingual ASR and LID
This section introduces the architecture of the proposed unified system for multilingual speech recognition and language identification. Fig. 1 shows an overview of the collaborative framework of ASR and LID. The proposed collaborative framework combines the ASR task and LID task and allows them to complement each other. The LID task provides a language information feature vector (i.e., an x-vector) to perform the language adaptive training of the multilingual acoustic model. The multilingual
Experimental configuration and results
The experiments are conducted on three Babel databases: Cantonese1 (Can), Turkish2 (Tur), and Vietnamese3 (Vie). The Babel databases are collected as part of the IARPA (Intelligence Advanced Research Projects Activity) Babel program. The audio of the corpus is divided into segments according to the time point in the transcription. In this paper,
Conclusion
This paper proposed a unified framework for multilingual ASR and LID, to solve the problem of multilingual speech recognition in two scenarios. One is the multilingual speech recognition problem in which the speech stream contains only one language and the language identity needs to be judged at the utterance level, and the other is the multilingual speech recognition problem in which multiple languages are contained in the same speech stream and the time point of language switching needs to be
CRediT authorship contribution statement
Danyang Liu: Investigation, Data curation, Writing - original draft. Ji Xu: Methodology, Writing - review & editing. Pengyuan Zhang: Funding acquisition. Yonghong Yan: Funding acquisition.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work is partially supported by the National Key Research and Development Program, China (No. 2019QY1805), the National Natural Science Foundation of China (Nos. 61901466, 11590774, 11590770), and the National Key Research and Development Program, China (Nos. 2016YFB0801203, 2016YFB0801200).
References (44)
- et al.
Weighted finite-state transducers in speech recognition
Comput. Speech Lang.
(2002) - et al.
Automatic language identification
Speech Commun.
(2001) Foundations of Bilingual Education and Bilingualism, Vol. 79
(2011)- et al.
Language identification oriented to multilingual speech recognition in the basque context
- et al.
Multi-task learning in deep neural networks for mandarin-english code-mixing speech recognition
IEICE Trans. Inf. Syst.
(2016) - et al.
Automatic segmentation and identification of mixed-language speech using delta-BIC and LSA-based GMMs
IEEE Trans. Audio Speech Lang. Process.
(2006) - et al.
Front-end factor analysis for speaker verification
IEEE Trans. Audio Speech Lang. Process.
(2011) - Dehak, N., Torres-Carrasquillo, P.A., Reynolds, D., Dehak, R., 2011. Language recognition via i-vectors and...
The viterbi algorithm
Proc. IEEE
(1973)- et al.
A real-time end-to-end multilingual speech recognition architecture
IEEE J. Sel. Top. Sign. Proces.
(2015)