Elsevier

Speech Communication

Volume 127, March 2021, Pages 17-28
Speech Communication

A unified system for multilingual speech recognition and language identification

https://doi.org/10.1016/j.specom.2020.12.008Get rights and content

Highlights

  • This paper presents a multilingual speech recognition system.

  • The ASR module and LID module are constructed in a unified architecture to complement each other.

  • The LID module contributes to the language adaptive training of the acoustic model.

  • The ASR decoding information acts as the confidence metric to balance the LID results.

  • The Viterbi beam search algorithm is applied to dynamic language identification.

Abstract

In this paper, a multilingual automatic speech recognition (ASR) and language identification (LID) system is designed. In contrast to conventional multilingual ASR systems, this paper takes advantage of the complementarity of the ASR and LID modules. First, the LID module contributes to the language adaptive training of the multilingual acoustic model. Then, the ASR decoding information acts as the confidence metric to balance the LID results. To simulate complex multilingual speech recognition situations, two types of LID strategies are conducted. For a multilingual speech recognition task in which only one language is contained in the speech stream, the language information can be directly determined based on utterance-level judgment. Under this condition, a segment-level statistical component and a two-stage update strategy are designed to assist in the utterance-level language classification. In another multilingual speech recognition task, where the speech stream contains multiple languages simultaneously, the Viterbi language state retrieval method based on neural network (NN) classification is used to perform dynamic detection of the language state. In both cases, the ASR decoding information is used to adjust the language classification results. Without prior knowledge of language identity information, the enhanced LID module achieves an accuracy of 99.3% for utterance-level language judgment and 92.4% for dynamic language detection, and the multilingual ASR system also provides performance comparable to that of monolingual ASR systems.

Introduction

Multilingual speech recognition research has drawn increasing attention because of frequent international communication. Some studies have shown that multilingual speakers outnumber monolingual speakers in many areas (Waggoner, 1993, Baker, 2011). Therefore, multilingual speech recognition systems are in high demand. Typical multilingual automatic speech recognition (ASR) systems take advantage of parallel monolingual ASR systems to facilitate multilingual speech recognition. Therefore, it is crucial to make a correct judgment of the language identity information to ensure that the multilingual speech can be handled by the corresponding monolingual ASR back-end.

Existing multilingual ASR systems can be roughly grouped into three categories. The first category is cascade multilingual ASR systems with a language identification (LID) front-end and multiple monolingual ASR system back-ends. The second category is parallel multilingual ASR systems, in which the language information and speech content are recognized simultaneously. The third category is end-to-end multilingual ASR systems, which unify ASR and LID in the same process.

A cascade multilingual ASR system depends largely on the performance of the LID front-end (Lyu and Lyu, 2008, Barroso et al., 2010, Mabokela and Manamela, 2013). In Lin et al. (2012), to fulfill real-time LID requests and enhance the performance of the LID module, a margin was added to each LID score to compensate for possibly biased language decisions. Even when existing LID technologies improve the LID accuracy of cascade systems, the tandem mechanism inevitably causes delays, which make such systems difficult to apply in real-time ASR systems.

To reduce the response time of multilingual ASR systems, the preference is to conduct the LID and ASR processes in a parallel manner. In Gonzalez-Dominguez et al. (2015), several language-detection strategies have been investigated to reduce the overall system latency. Since LID and ASR can be performed simultaneously, parallel processing also provides the ability to perform intrasentence and intersentence multilingual ASR, where two or more languages are within the same speech stream. In Chung-Hsien Wu et al. (2006), the authors proposed an approach to segment and identify intrasentence multilingual speech. To detect the switch point of language, a dynamic programming method was used to determine language boundaries globally. Although that study was implemented in an offline manner, it provides an impetus for online multilingual ASR.

Thanks to the successful development of the end-to-end framework in the field of speech recognition, multilingual ASR can also be conducted with an end-to-end framework. In Watanabe et al. (2017), the end-to-end architecture is first applied to language-independent multilingual speech recognition. In some works, intersentence (Seki et al., 2018) multilingual speech recognition and intrasentence (Zeng et al., 2019, Luo et al., 2018) multilingual speech recognition have also been investigated under the end-to-end framework to process the ASR and LID simultaneously. Although the end-to-end system unifies ASR and LID processes, multilingual joint modeling under this framework also causes new problems. Due to the differences in pronunciation mechanisms and grammar rules, joint modeling of multiple languages may inevitably result in confusion among languages (Kannan et al., 2019). The size of the modeling unit varies among languages, which causes imbalance in the modeling unit (Irie et al., 2019). In addition, for languages such as Chinese and Japanese, the modeling unit may be excessively large (Li et al., 2019). For different languages, the same words usually have different pronunciations. Since the mapping between the character-based modeling units and the acoustic features is modeled end-to-end, the acoustic features corresponding to the same character-based modeling unit are different among languages.

Because the modeling units are balanced among languages, and the pronunciation rules and the grammatical rules are modeled separately in hidden Markov model (HMM)-based ASR systems. Compared to the end-to-end multilingual framework, the HMM-based multilingual framework is more stable. In addition, in contrast to cascade multilingual ASR systems, parallel multilingual ASR systems process the ASR and LID simultaneously and can take advantage of their complementarity. Based on the above analysis, this paper focuses on the construction of a HMM-based parallel multilingual ASR system.

In this paper, the multilingual ASR is based on the multitask framework. The shared-hidden-layers are modeled together by multiple languages, while the language-specific-layers are modeled independently by a single language. The multitask-based multilingual acoustic model can ignore the differences in language families and model multiple languages in a uniform framework while supplementing them in the acoustic modeling procedure (Veselý et al., 2012). However, the training of a shared-hidden-layer network lacks language discrimination, and it is necessary to perform language adaptive training. In Tong et al. (2017), the authors investigated several language adaptive training methods by using speaker adaptive training methods and achieved a notable improvement. In this paper, language identity information is also applied to conduct language adaptive training of the multilingual acoustic model.

LID, which is the foundation of correctly generating speech content, is another core module of the multilingual ASR system. In general, LID methods can be classified into three categories. The first and most popular method is the i-vector-based LID method, which takes advantage of language information features, called the i-vector (Dehak et al., 2011). Second is the ASR-based LID method, which leverages a series of parallel large vocabulary continuous speech recognition (LVCSR) systems and generates the LID result with an ASR confidence score. The last is the deep neural network (DNN)-based LID method, which directly models language classification with DNNs. Because LID is a long-term classification task, the i-vector-based method can provide satisfactory performance in most instances, especially when the utterance duration is sufficiently long (Martinez et al., 2011, Dehak et al., 2011). ASR-based LID methods, which can be specified as parallel phone recognition followed by language model (PPRLM) (Yonghong Yan and Barnard, 1995, Zissman, 1996) and parallel word recognition followed by language model (PWRLM) (Zissman and Berkling, 2001), take advantage of language-related pronunciation characteristics and can effectively distinguish different languages. NN-based methods are limited by the length of the context history and can only make a decision about short-term acoustic features. Such short-term characteristics of NN-based methods result in worse performance than that of the i-vector-based methods when the utterance duration is long (Lopez-Moreno et al., 2014, Gonzalez-Dominguez et al., 2014, Tang et al., 2018).

This paper solves the problem of multilingual speech recognition in two scenarios. The first is the multilingual speech recognition problem in which the speech stream contains only one language and the language identity needs to be judged at the utterance level. The second is the multilingual speech recognition problem in which multiple languages are contained in the same speech stream and the time point of language switching must be detected dynamically. The NN-based LID method is adopted to accomplish the utterance-level language judgment, and the ASR confidence score is supplied as a balance metric. This paper represents the first time that the Viterbi beam search algorithm (Viterbi, 1967, Forney, 1973) is applied to accomplish dynamic language identification with a chunk-level NN-based LID classifier. Additionally, it is the first time the ASR module and LID module are constructed in a unified architecture such that they can complement each other. The LID module contributes to the language adaptive training of the multilingual acoustic model, and the ASR decoding information acts as a confidence metric to balance the LID results. This paper is organized as follows. Section 2 describes the proposed collaborative framework for multilingual ASR and LID. Then, two types of LID applications of multilingual ASR are introduced. The details of the experimental configuration and the multilingual ASR and LID performance are shown in Section 3. Finally, a summary of this paper is presented in Section 4.

Section snippets

Multitask collaborative framework for multilingual ASR and LID

This section introduces the architecture of the proposed unified system for multilingual speech recognition and language identification. Fig. 1 shows an overview of the collaborative framework of ASR and LID. The proposed collaborative framework combines the ASR task and LID task and allows them to complement each other. The LID task provides a language information feature vector (i.e., an x-vector) to perform the language adaptive training of the multilingual acoustic model. The multilingual

Experimental configuration and results

The experiments are conducted on three Babel databases: Cantonese1 (Can), Turkish2 (Tur), and Vietnamese3 (Vie). The Babel databases are collected as part of the IARPA (Intelligence Advanced Research Projects Activity) Babel program. The audio of the corpus is divided into segments according to the time point in the transcription. In this paper,

Conclusion

This paper proposed a unified framework for multilingual ASR and LID, to solve the problem of multilingual speech recognition in two scenarios. One is the multilingual speech recognition problem in which the speech stream contains only one language and the language identity needs to be judged at the utterance level, and the other is the multilingual speech recognition problem in which multiple languages are contained in the same speech stream and the time point of language switching needs to be

CRediT authorship contribution statement

Danyang Liu: Investigation, Data curation, Writing - original draft. Ji Xu: Methodology, Writing - review & editing. Pengyuan Zhang: Funding acquisition. Yonghong Yan: Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is partially supported by the National Key Research and Development Program, China (No. 2019QY1805), the National Natural Science Foundation of China (Nos. 61901466, 11590774, 11590770), and the National Key Research and Development Program, China (Nos. 2016YFB0801203, 2016YFB0801200).

References (44)

  • MohriM. et al.

    Weighted finite-state transducers in speech recognition

    Comput. Speech Lang.

    (2002)
  • ZissmanM.A. et al.

    Automatic language identification

    Speech Commun.

    (2001)
  • BakerC.

    Foundations of Bilingual Education and Bilingualism, Vol. 79

    (2011)
  • BarrosoN. et al.

    Language identification oriented to multilingual speech recognition in the basque context

  • ChenM. et al.

    Multi-task learning in deep neural networks for mandarin-english code-mixing speech recognition

    IEICE Trans. Inf. Syst.

    (2016)
  • Chung-Hsien WuM. et al.

    Automatic segmentation and identification of mixed-language speech using delta-BIC and LSA-based GMMs

    IEEE Trans. Audio Speech Lang. Process.

    (2006)
  • DehakN. et al.

    Front-end factor analysis for speaker verification

    IEEE Trans. Audio Speech Lang. Process.

    (2011)
  • Dehak, N., Torres-Carrasquillo, P.A., Reynolds, D., Dehak, R., 2011. Language recognition via i-vectors and...
  • ForneyG.

    The viterbi algorithm

    Proc. IEEE

    (1973)
  • Gonzalez-DominguezJ. et al.

    A real-time end-to-end multilingual speech recognition architecture

    IEEE J. Sel. Top. Sign. Proces.

    (2015)
  • Gonzalez-Dominguez, J., Lopez-Moreno, I., Sak, H., Gonzalez-Rodriguez, J., Moreno, P.J., 2014. Automatic language...
  • HampshireJ. et al.

    A novel objective function for improved phoneme recognition using time delay neural networks

  • HieronymusJ. et al.

    Robust spoken language identification using large vocabulary speech recognition

  • IrieK. et al.

    On the choice of modeling unit for sequence-to-sequence speech recognition

  • KannanA. et al.

    Large-scale multilingual speech recognition with a streaming end-to-end model

  • KimS. et al.

    Towards language-universal end-to-end speech recognition

  • LamelL. et al.

    Language identification using phone-based acoustic likelihoods

  • Li, B., Zhang, Y., Sainath, T., Wu, Y., Chan, W., 2019. Bytes are all you need: End-to-end multilingual speech...
  • LinH. et al.

    Recognition of multilingual speech in mobile applications

  • LiuD. et al.

    Multilingual speech recognition training and adaptation with language-specific gate units

  • Lopez-MorenoI. et al.

    Automatic language identification using deep neural networks

  • LuoN. et al.

    Towards end-to-end code-switching speech recognition

    (2018)
  • Cited by (0)

    View full text