Skip to main content
Log in

Variance Normalised Features for Language and Dialect Discrimination

  • Short Paper
  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

This paper proposes novel features for automated language and dialect identification that aim to improve discriminative power by ensuring that each element of the feature vector has a normalised contribution to inter-class variance. The method firstly computes inter- and intra-class frequency variance statistics and then distributes the overall spectral variance across spectral regions which are sized to contain near-equal-variance difference. Spectral features are average pooled within regions to obtain variance normalised features (VNFs). The proposed VNFs are low complexity drop-in replacements for MFCC, SDC, PLP or other input features used for speech-related tasks. In this paper, they are evaluated in three types of system, against MFCCs, for two data-constrained language and dialect identification tasks. VNFs demonstrate good results, comfortably outperforming MFCCs at most dimension sizes, and yielding particularly good performance for the most challenging data-constrained 3s utterance length in the LID task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data Availability Statement

The LRE07 dataset analysed during the first part of the current study is licensed and is publicly available from, the Linguistic Data Consortium, https://catalog.ldc.upenn.edu/LDC2009S04.

The ADI17 dataset analysed during the second part of the study is available for download for research purposes under a Creative Commons Attribution-ShareAlike 4.0 International License from http://groups.csail.mit.edu/sls/downloads/adi17. Copyright remains with the original owners of the recordings.

References

  1. A.I. Ahmed, J.P. Chiverton, D.L. Ndzi, V.M. Becerra, Speaker recognition using PCA-based feature transformation. Speech Commun. 110, 33–46 (2019)

    Article  Google Scholar 

  2. A. Ali, S. Shon, Y. Samih, H. Mubarak, A. Abdelali, J. Glass, S. Renals, K. Choukri, K. The MGB-5 challenge: Recognition and dialect identification of dialectal Arabic speech. in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (2019)

  3. W. Cai, Z. Cai, W. Liu, X. Wang, M. Li. Insights into end-to-end learning scheme for language identification. arXiv preprint arXiv:1804.00381 (2018)

  4. W. Cai, Z. Cai, X. Zhang, X. Wang, M. Li, A novel learnable dictionary encoding layer for end-to-end language identification. in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5189–5193. IEEE (2018)

  5. W.M. Campbell, J. Campbell, D.A. Reynolds, D.A. Jones, T.R. Leek, High-level speaker verification with support vector machines. in 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. I–73. IEEE (2004)

  6. W.M. Campbell, F. Richardson, D.A. Reynolds, Language recognition with word lattices and support vector machines. in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, vol. 4, pp. IV–989. IEEE (2007)

  7. W.M. Campbell, D.E. Sturim, D.A. Reynolds, Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett. 13(5), 308–311 (2006)

    Article  Google Scholar 

  8. S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoustics Speech Signal Process. 28(4), 357–366 (1980)

    Article  Google Scholar 

  9. N. Dehak, P.A. Torres-Carrasquillo, D. Reynolds, R. Dehak, Language recognition via i-vectors and dimensionality reduction. in Twelfth Annual Conference of the International Speech Communication Association (2011)

  10. V. Fromkin, R. Rodman, N. Hyams, An Introduction to Language (Cengage Learning, Boston, 2018)

    Google Scholar 

  11. H. Hermansky, Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990)

    Article  Google Scholar 

  12. B. Jiang, Y. Song, S. Wei, J.H. Liu, I.V. McLoughlin, L.R. Dai, Deep bottleneck features for spoken language identification. PLoS ONE 9(7), e100795 (2014)

    Article  Google Scholar 

  13. M. Jin, Y. Song, I. McLoughlin, L.R. Dai, LID-senones and their statistics for language identification. IEEE/ACM Trans. Audio Speech Lang. Process. 26(1), 171–183 (2018)

    Article  Google Scholar 

  14. M.A. Kohler, M. Kennedy, Language identification using shifted delta cepstra. in The 2002 45th Midwest Symposium on Circuits and Systems, 2002. MWSCAS-2002., vol. 3, pp. III–69. IEEE (2002)

  15. Y. Lei, L. Ferrer, A. Lawson, M. McLaren, N. Scheffer, Application of convolutional neural networks to language identification in noisy conditions. in Odyssey (2014)

  16. H. Li, B. Ma, C.H. Lee, A vector space modeling approach to spoken language identification. IEEE Trans. Audio Speech Lang. Process. 15(1), 271–284 (2007)

    Article  Google Scholar 

  17. H. Li, B. Ma, K.A. Lee, Spoken language recognition: from fundamentals to practice. Proc. IEEE 101(5), 1136–1159 (2013)

    Article  Google Scholar 

  18. W. Malina, On an extended Fisher criterion for feature selection. IEEE Trans. Pattern Anal. Mach. Intell. 5, 611–614 (1981)

    Article  Google Scholar 

  19. I.V. McLoughlin, Speech and Audio Processing: a MATLAB-Based Approach (Cambridge University Press, Cambridge, 2016)

    Book  Google Scholar 

  20. X. Miao, I. McLoughlin, Y. Yan, A new time-frequency attention mechanism for TDNN and CNN-LSTM-TDNN, with application to language identification. in Proc. Interspeech (2019)

  21. H. Mukherjee, S.M. Obaidullah, K. Santosh, S. Phadikar, K. Roy, A lazy learning-based language identification from speech using MFCC-2 features. Int. J. Mach. Learn. Cybernetics 11(1), 1–14 (2020)

    Article  Google Scholar 

  22. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al. The Kaldi speech recognition toolkit. in IEEE 2011 workshop on automatic speech recognition and understanding, EPFL-CONF-192584. IEEE Signal Processing Society (2011)

  23. D.A. Reynolds, T.F. Quatieri, R.B. Dunn, Speaker verification using adapted Gaussian mixture models. Digital Signal Process. 10(1–3), 19–41 (2000)

    Article  Google Scholar 

  24. F. Richardson, D. Reynolds, N. Dehak, Deep neural network approaches to speaker and language recognition. IEEE Signal Process. Lett. 22(10), 1671–1675 (2015)

    Article  Google Scholar 

  25. H. Sak, A. Senior, F. Beaufays, Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv preprint arXiv:1402.1128 (2014)

  26. D. Snyder, D. Garcia-Romero, A. McCree, G. Sell, D. Povey, S. Khudanpur, Spoken language recognition using x-vectors. in Odyssey, pp. 105–111 (2018)

  27. Y. Song, X. Hong, B. Jiang, R. Cui, I. McLoughlin, L.R. Dai, Deep bottleneck network based i-vector representation for language identification. in Sixteenth Annual Conference of the International Speech Communication Association. Proc. Interspeech (2015)

  28. Y. Song, B. Jiang, Y. Bao, S. Wei, L.R. Dai, I-vector representation based on bottleneck features for language identification. Electron. Lett. 49(24), 1569–1570 (2013)

    Article  Google Scholar 

  29. National Institute of Standards and Technology: The 2007 NIST language recognition evaluation plan (lre07) (2007). https://catalog.ldc.upenn.edu/docs/LDC2009S04/LRE07EvalPlan-v8b-1.pdf 3b

  30. Z. Tang, D. Wang, Y. Chen, L. Li, A. Abel, Phonetic temporal neural model for language identification. IEEE/ACM Trans. Audio Speech Lang. Process. 26(1), 134–144 (2018)

    Article  Google Scholar 

  31. P.A. Torres-Carrasquillo, E. Singer, M.A. Kohler, R.J. Greene, D.A. Reynolds, J.R. Deller Jr, Approaches to language identification using Gaussian mixture models and shifted delta cepstral features. in Seventh International Conference on Spoken Language Processing (2002)

  32. H. Wang, C.C. Leung, T. Lee, B. Ma, H. Li, Shifted-delta MLP features for spoken language recognition. IEEE Signal Process. Lett. 20(1), 15–18 (2013)

    Article  Google Scholar 

  33. Z. Xie, I. McLoughlin, H. Zhang, Y. Song, W. Xiao, A new variance-based approach for discriminative feature extraction in machine hearing classification using spectrogram features. Digit. Signal Proc. 54, 119–128 (2016)

    Article  Google Scholar 

  34. Y. Yan, E. Barnard, An approach to automatic language identification based on language-dependent phone recognition. in 1995 International Conference on Acoustics, Speech, and Signal Processing, vol. 5, pp. 3511–3514. IEEE (1995)

  35. Y. Yan, E. Barnard, R.A. Cole, Development of an approach to automatic language identification based on phone recognition. Comput. Speech Lang. 10(1), 37–54 (1996)

    Article  Google Scholar 

  36. Y. Yang, H. Zhang, W. Tu, H. Ai, L. Cai, R. Hu, F. Xiang, Kullback-Leibler divergence frequency warping scale for acoustic scene classification using convolutional neural network. in ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 840–844 (2019). https://doi.org/10.1109/ICASSP.2019.8683000

  37. M.A. Zissman, Comparison of four approaches to automatic language identification of telephone speech. IEEE Trans. Speech Audio Process. 4(1), 31 (1996)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ian McLoughlin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was partially supported by the National Key Research and Development Program (Nos. 2016YFB0801203, 2016YFB0801200), the National Natural Science Foundation of China (Nos. 11590774, 11590770), the Key Science and Technology Project of the Xinjiang Uygur Autonomous Region (No.2016A03007-1) and the China Scholarship Council, which funded the first author to conduct the research at the University of Kent in the UK.

Appendices

Appendices

The main acronyms used in the paper are listed below:

Acronyms

Name

ADI

Arabic dialect identification

AED

Audio Event Detection

ASR

Automatic Speech Recognition

CLSTM

Convolutional and Long Short Term Memory-Recurrent

DBF

Deep Bottleneck Feature

DBN

Deep Bottleneck Network

DNN

Deep Neural Networks

DID

Dialect Identification

EER

Equal Error Rate

GMM-UBM

Gaussian Mixture Model-Universal Background Model

LDA

Linear Discriminant Analysis

LID

Language Identification

LSTM

Long Short-Term Memory

MFCC

Mel-frequency Cepstral Coefficients

PR

Phoneme Recognisers

PRLM

Phoneme Recognizer followed by Language Model

PPRLM

Parallel Phone Recognizer followed by Language Model

PPR-SVM

Parallel Phoneme Recognizer followed by Support Vector Machines

PPR-VSM

Parallel Phoneme Recognizer followed by Vector Space Model

PLP

Perceptual Linear Prediction

SDC

Shift Delta Cepstra

SDNN

Simple DNN x-vector

TDNN

Time-Delay Neural Network

VNFs

Variance Normalised Features

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Miao, X., McLoughlin, I. & Song, Y. Variance Normalised Features for Language and Dialect Discrimination. Circuits Syst Signal Process 40, 3621–3638 (2021). https://doi.org/10.1007/s00034-020-01641-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-020-01641-1

Keywords

Navigation