Skip to main content
Log in

Multilingual Speech Corpus in Low-Resource Eastern and Northeastern Indian Languages for Speaker and Language Identification

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Research and development of speech technology applications in low-resource languages (LRL) are challenging due to the non-availability of proper speech corpus. Especially, for most of the Indian languages, the amount and type of data found in different digital sources are sparse and prior works are too few to serve the purpose of large-scale development needs. This paper illustrates the creation process of such an LRL corpus comprising of sixteen rarely studied Eastern and Northeastern (E&NE) Indian languages and presents the data variability with different statistics. Furthermore, several experiments are carried out using the collected LRL corpus to build baseline speaker identification (SID) and language identification (LID) system for acceptance evaluation. For investigating the presence of speaker and language-specific information, spectral features like Mel frequency cepstral coefficients (MFCCs), shifted delta cepstral (SDC), and relative spectral transform-perceptual linear prediction (RASTA-PLP) features are used here. Vector quantization (VQ), Gaussian mixture models (GMMs), support vector machine (SVM), and multilayer perceptron (MLP)-based models are developed to represent the speaker and language-specific information captured through the spectral features. Apart from this, i-vectors, time delay neural networks (TDNN), and recurrent neural network with long short-term memory (LSTM-RNN) method-based SID and LID models are being experimented with to comply with the recent approaches. Performances of the developed systems are analyzed with LRL corpus in terms of SID and LID accuracy. The best SID and LID performances are observed to be 94.49% and 95.69%, respectively, for the baseline systems using LSTM-RNN with MFCC + SDC feature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability statement

The data that support the findings of this study are available on request from the corresponding author J. Basu, after seeking permission from the funding agency of the project. The data are not publicly available due to Data was collected under a project funded by the Ministry of Electronics and Information Technology (MeitY), Govt. of India, for the applications of Speaker and language Identification of Low-resource languages of Eastern and Northeast Indian Languages.

References

  1. F. Allen, E. Ambikairajah, J. Epps, Language Identification using Warping and the Shifted Delta Cepstrum. 2005 IEEE 7th Workshop on Multimedia Signal Processing, 1–4. (2005) https://doi.org/10.1109/MMSP.2005.248554

  2. S. M. Amalesh Gope, Lexical Tones in Sylheti. In C. Gussenhoven, Y. Chen, & D. Dediu (Eds.), 4th International Symposium on Tonal Aspects of Languages (TAL-2014) (pp. 10–14). http://www.isca-speech.org/archive/tal_2014 (2014)

  3. A. Baby, A.L.N. Thomas, T. T. S. Consortium, Resources for Indian languages. In CBBLR – Community Based Building of Language Resources (pp. 37–43) (2016)

  4. S. Balakrishnama, A. Ganapathiraju, Linear discriminant analysis-a brief tutorial. Inst. Signal Inf. Process. 18, 1–8 (1998)

    Google Scholar 

  5. D. Barsha, C. Joyshree, A.D. Shikhamoni, N. S. Priyankoo, S.R. Nirmala, S. V.: SPEECH CORPORA OF UNDER RESOURCED LANGUAGES OF NORTH-EAST INDIA. Oriental COCOSDA, 72–77. (2018) https://doi.org/10.1109/ICSDA.2018.8693038

  6. J. Basu, S. Khan, R. Roy, B. Saxena, D. Ganguly, S. Arora, K. K. Arora, S. Bansal, S. S. Agrawal: Indian Languages Corpus for Speech Recognition. 2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-Ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), 1–6 (2019) https://doi.org/10.1109/O-COCOSDA46868.2019.9041171

  7. J. Basu, T. Basu, S. Khan, M. Pal, R. Roy, M S. Bepari, S. Nandi, T. K. Basu, S. Majumder, S. Chatterjee, Acoustic analysis of vowels in five low resource north East Indian languages of Nagaland. 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), 1–6 (2017) https://doi.org/10.1109/ICSDA.2017.8384460

  8. J. Basu, S. Khan, R. Roy, M. S. Bepari, Commodity price retrieval system in Bangla. Proceedings of the 11th Asia Pacific Conference on Computer Human Interaction - APCHI ’13, 406–415 (2013) https://doi.org/10.1145/2525194.2525310

  9. J. Basu, S. Khan, M. Samirakshma Bepari, R. Roy, M. Pal, S. Nandi, Designing an IVR based framework for telephony speech data collection and transcription in under-resourced languages. The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages, 47–51 (2018) https://doi.org/10.21437/SLTU.2018-10

  10. J. Basu, S. Majumder, Identification of seven low-resource north-eastern languages: an experimental study. In D. P. Bhattacharyya S., Mitra S. (Ed.), Intelligence Enabled Research. Advances in Intelligent Systems and Computing, vol 1109 (pp. 71–81). Springer, Singapore (2020) https://doi.org/10.1007/978-981-15-2021-1_9

  11. L. Besacier, E. Barnard, A. Karpov, T. Schultz, Automatic speech recognition for under-resourced languages: a survey. Speech Commun. 56, 85–100 (2014). https://doi.org/10.1016/j.specom.2013.07.008

    Article  Google Scholar 

  12. R.H. Bolt, F. Cooper, E. David, P.B. Denes, J.M. Pickett, K.N. Stevens, Speaker identification by speech spectrograms: a scientists’ view of its reliability for legal purpose. J. Acoust. Soc. Am. 47(2B), 597–612 (1970)

    Article  Google Scholar 

  13. J.P. Campbell, Speaker recognition: a tutorial. Proc. IEEE 85(9), 1437–1462 (1997). https://doi.org/10.1109/5.628714

    Article  Google Scholar 

  14. W.M. Campbell, D.E. Sturim, D.A. Reynolds, Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett. 13(5), 308–311 (2006). https://doi.org/10.1109/LSP.2006.870086

    Article  Google Scholar 

  15. B. Das, S. Mandal, P. Mitra, Bengali speech corpus for continuous auutomatic speech recognition system. 2011 International Conference on Speech Database and Assessments (Oriental COCOSDA), 51–55 (2011) https://doi.org/10.1109/ICSDA.2011.6085979

  16. S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980). https://doi.org/10.1109/TASSP.1980.1163420

    Article  Google Scholar 

  17. A. Debbarma, Isolated Kokborok Vowels Recognition. In Global Trends in Information Systems and Software Applications (pp. 489–493) (2012) https://doi.org/10.1007/978-3-642-29216-3_53

  18. N. Dehak, A.Torres-Carrasquillo, P. Reynolds, R. Dehak, Language Recognition via Ivectors and Dimensionality Reduction. INTERSPEECH, 857–860. https://www.isca-speech.org/archive/interspeech_2011/i11_0857.html (2011)

  19. N. Dehak, R. Dehak, P. Kenny, N. Brümmer, P. Ouellet, P. Dumouchel, Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification. INTERSPEECH, 1559–1562. https://www.isca-speech.org/archive/interspeech_2009/i09_1559.html (2009)

  20. N. Dehak, P.J. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011). https://doi.org/10.1109/TASL.2010.2064307

    Article  Google Scholar 

  21. A.P. Dempster, N.M. Laird, B. RubinD, Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39, 1–8 (1977)

    MathSciNet  MATH  Google Scholar 

  22. Dey, N. S., Mohanty, R., & Chugh, K. L.: Speech and Speaker Recognition System Using Artificial Neural Networks and Hidden Markov Model. 2012 International Conference on Communication Systems and Network Technologies, 311–315 (2012) https://doi.org/10.1109/CSNT.2012.221

  23. H. Dubey, A. Sangwan, J. H. L. Hansen, Robust feature clustering for unsupervised speech activity detection. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2726–2730 (2018) https://doi.org/10.1109/ICASSP.2018.8461652

  24. T. Gallinari, Soulie.: Multilayer perceptrons and data analysis. IEEE 1988 International Conference on Neural Networks, 391–399 vol.1 (1988) https://doi.org/10.1109/ICNN.1988.23871

  25. F.A. Gers, J. Schmidhuber, F. Cummins, Learning to forget: continual prediction with LSTM. Neural Comput. 12(10), 2451–2471 (2000). https://doi.org/10.1162/089976600300015015

    Article  Google Scholar 

  26. T. Godambe, N. Bondale, K. Samudravijaya, P. Rao, Multi-speaker, narrowband, continuous Marathi speech database. 2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 1–6 (2013) https://doi.org/10.1109/ICSDA.2013.6709844

  27. S. Guha, A. Das, P.K. Singh, A. Ahmadian, N. Senu, R. Sarkar, Hybrid feature selection method based on harmony search and naked mole-rat algorithms for spoken language identification from audio signals. IEEE Access 8, 182868–182887 (2020). https://doi.org/10.1109/ACCESS.2020.3028121

    Article  Google Scholar 

  28. R. M. Hegde, H. A. Murthy, Automatic language identification and discrimination using the modified group delay feature. Proceedings of 2005 International Conference on Intelligent Sensing and Information Processing, 2005., 395–399 (2005) https://doi.org/10.1109/ICISIP.2005.1529484

  29. H. Hermansky, N. Morgan, RASTA processing of speech. IEEE Trans. Speech Audio Process. 2(4), 578–589 (1994). https://doi.org/10.1109/89.326616

    Article  Google Scholar 

  30. H. Hermansky, N. Morgan, A. Bayya, P. Kohn, RASTA-PLP speech analysis technique. [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, 121–124 vol.1 (1992) https://doi.org/10.1109/ICASSP.1992.225957

  31. H. Hermansky, Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990). https://doi.org/10.1121/1.399423

    Article  Google Scholar 

  32. G.E. Hinton, Connectionist learning procedures. Artif. Intell. 40(1–3), 185–234 (1989). https://doi.org/10.1016/0004-3702(89)90049-0

    Article  Google Scholar 

  33. S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  34. S. Jothilakshmi, V. Ramalingam, S. Palanivel, A hierarchical language identification system for Indian languages. Digital Signal Process. 22(3), 544–553 (2012). https://doi.org/10.1016/j.dsp.2011.11.008

    Article  MathSciNet  Google Scholar 

  35. M. Kaiser, Time-delay neural networks for control. IFAC Proc. Vol. 27(14), 967–972 (1994). https://doi.org/10.1016/S1474-6670(17)47423-4

    Article  Google Scholar 

  36. P. Kenny, P. Ouellet, N. Dehak, V. Gupta, P. Dumouchel, A study of interspeaker variability in speaker verification. IEEE Trans. Audio Speech Lang. Process. 16(5), 980–988 (2008). https://doi.org/10.1109/TASL.2008.925147

    Article  Google Scholar 

  37. L.G. Kersta, Speaker recognition and identification by voice prints. Conn. BJ 40, 586 (1966)

    Google Scholar 

  38. A. N. Khan, S. V. Gangashetty, B. Yegnanarayana, Syllabic properties of three Indian languages: Implications for speech recognition and language identification. International Conference on Natural Language Processing, 125–134 (2003)

  39. S. Khan, J. Basu, M. S. Bepari, Performance Evaluation of PBDP Based Real-Time Speaker Identification System with Normal MFCC vs MFCC of LP Residual Features. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): Vol. 7143 LNCS, pp. 358–366 (2012) https://doi.org/10.1007/978-3-642-27387-2_44

  40. S. Khan, J. Basu, M. S. Bepari, R. Roy, Pitch based selection of optimal search space at runtime: Speaker recognition perspective. 2012 4th International Conference on Intelligent Human Computer Interaction (IHCI), 1–6 (2012) https://doi.org/10.1109/IHCI.2012.6481822

  41. S. Khan, J. Basu, M. Pal, R. Roy, M. S. Bepari, Multilingual conversational telephony speech corpus creation for real world speaker diarization and recognition. 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA), 177–182 (2016) https://doi.org/10.1109/ICSDA.2016.7919007

  42. T. Kinnunen, P. Fränti, Speaker discriminative weighting method for VQ-based speaker identification. In B. J. & S. F. (Eds.), Audio- and Video-Based Biometric Person Authentication. AVBPA 2001,Lecture Notes in Computer Science, vol 2091, pp. 150–156 (2001) https://doi.org/10.1007/3-540-45344-X_22

  43. J.W.G.N. Kleiner, Speaker identification based on nasal phonation. J. Acoust. Soc. Am. 43(2), 368–372 (1968)

    Article  Google Scholar 

  44. M. A. Kohler, M. Kennedy, Language identification using shifted delta cepstra. The 2002 45th Midwest Symposium on Circuits and Systems, 2002. MWSCAS-2002., 3, III-69–72 (2002) https://doi.org/10.1109/MWSCAS.2002.1186972

  45. K.J. Lang, A.H. Waibel, G.E. Hinton, A time-delay neural network architecture for isolated word recognition. Neural Netw. 3(1), 23–43 (1990). https://doi.org/10.1016/0893-6080(90)90044-L

    Article  Google Scholar 

  46. H.S. Lee, Y. Tsao, S.-K. Jeng, H.-M. Wang, Subspace-based representation and learning for phonotactic spoken language recognition. IEEE/ACM Trans. Audio Speech Lan. Process. 28, 3065–3079 (2020). https://doi.org/10.1109/TASLP.2020.3037457

    Article  Google Scholar 

  47. Y. Linde, A. Buzo, R. Gray, An Algorithm for vector quantizer design. IEEE Trans. Commun. 28(1), 84–95 (1980). https://doi.org/10.1109/TCOM.1980.1094577

    Article  Google Scholar 

  48. J.E. Luck, Automatic speaker verification using cepstral mean surements. J. Acoust. Soc. Am. 56(4B), 1026–1032 (1969)

    Article  Google Scholar 

  49. S. Maity, A. Kumar Vuppala, K. S. Rao, D. Nandi, IITKGP-MLILSC speech database for language identification. 2012 National Conference on Communications (NCC), 1–5 (2012) https://doi.org/10.1109/NCC.2012.6176831

  50. S. Manchala, V.K. Prasad, V. Janaki, GMM based language identification system using robust features. Int. J. Speech Technol. 17(2), 99–105 (2014). https://doi.org/10.1007/s10772-013-9209-1

    Article  Google Scholar 

  51. S. Mandal Das, Saha, A., & Datta, A. K. Vishwa Bharat Annotated Speech Corpora Development in Indian languages 16, 49–64 (2005)

  52. E. Mansour, M. S. Sayed, A. M. Moselhy, A. A. Abdelnaiem, LPC and MFCC Performance Evaluation with Artificial Neural Network for Spoken Language Identification. International Journal of Signal Processing, Image Processing and Pattern Recognition, 6(3), 55–66 (2013)

  53. A. Mohan, R. Rose, S.H. Ghalehjegh, S. Umesh, Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain. Speech Commun. 56, 167–180 (2014). https://doi.org/10.1016/j.specom.2013.07.005

    Article  Google Scholar 

  54. K.V. Mounika, S. Achanta, R., L. H. Gangashetty, S. V., Vuppala, A. K.: An Investigation of Deep Neural Network Architectures for Language Recognition in Indian Languages. INTERSPEECH, 2930–2933 (2016) https://doi.org/10.21437/Interspeech.2016-910

  55. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, The Kaldi Speech Recognition Toolkit. IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. https://kaldi-asr.org/doc/about.html (2011)

  56. N. V. Prasad, S. Umesh, Improved cepstral mean and variance normalization using Bayesian framework. 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, 156–161 (2013) https://doi.org/10.1109/ASRU.2013.6707722

  57. L.R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989). https://doi.org/10.1109/5.18626

    Article  Google Scholar 

  58. D. Reynolds, Gaussian Mixture Models. In S. Z. Li & A. K. Jain (Eds.), Encyclopedia of biometrics, pp. 827–832, Springer US (2015). https://doi.org/10.1007/978-1-4899-7488-4_196

  59. F. Richardson, D. Reynolds, N. Dehak, Deep neural network approaches to speaker and language recognition. IEEE Signal Process. Lett. 22(10), 1671–1675 (2015). https://doi.org/10.1109/LSP.2015.2420092

    Article  Google Scholar 

  60. K. Samudravijaya, P.V.S. Rao, S. Agrawal, Hindi speech database. Sixth International Conference on Spoken Language Processing (ICSLP 2000), 456–459. https://www.isca-speech.org/archive/icslp_2000/i00_4456.html (2000)

  61. B. Sarma, P. Sarmah, W. Lalhminghlui, S. Prasanna, Detection of mizo tones. Interspeech 2015, 934–937. https://www.isca-speech.org/archive/interspeech_2015/i15_0934.html (2015)

  62. K. Sarmah, U. Bhattacharjee, GMM based language identification using MFCC and SDC features. Int. J. Comput. Appl. 85(5), 36–42 (2014). https://doi.org/10.5120/14840-3103

    Article  Google Scholar 

  63. S. Shahnawazuddin, D. Thotappa, B. D. Sarma, A. Deka, S. R. M. Prasanna, R. Sinha, Assamese spoken query system to access the price of agricultural commodities. 2013 National Conference on Communications (NCC), 1–5 (2013). https://doi.org/10.1109/NCC.2013.6488011

  64. P. Shen, X. Lu, S. Li, H. Kawai, Knowledge distillation-based representation learning for short-utterance spoken language identification. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 2674–2683 (2020). https://doi.org/10.1109/TASLP.2020.3023627

    Article  Google Scholar 

  65. D. Snyder, D. Garcia-Romero, A. McCree, G. Sell, D. Povey, S. Khudanpur, Spoken Language Recognition using X-vectors. Odyssey 2018 The Speaker and Language Recognition Workshop, 105–111 (2018) https://doi.org/10.21437/Odyssey.2018-15

  66. W.A.H.J.A. Starkweather, Recognition of Speaker Identity. Lang. Speech 6(2), 63–67 (1963)

    Article  Google Scholar 

  67. S. B. Sunil Kumar, K. S. Rao, D. Pati, Phonetic and Prosodically Rich Transcribed speech corpus in Indian languages: Bengali and Odia. Oriental COCOSDA, 1–5 (2013). https://doi.org/10.1109/ICSDA.2013.6709901

  68. Z. Tang, D. Wang, Y. Chen, L. Li, A. Abel, Phonetic temporal neural model for language identification. IEEE/ACM Trans. Audio Speech Lang. Process. 26(1), 134–144 (2018). https://doi.org/10.1109/TASLP.2017.2764271

    Article  Google Scholar 

  69. V.N. Vapnik, The Nature of Statistical Learning Theory. Springer, New York. (2000). https://doi.org/10.1007/978-1-4757-3264-1

    Article  MATH  Google Scholar 

  70. Vempada, R., Maity, S., & Rao, K.: Identification of Indian languages using multi-level spectral and prosodic features. International Journal of Speech Technology, 16 (2013). https://doi.org/10.1007/s10772-013-9198-0

  71. J. Villalba, N. Chen, D. Snyder, D. Garcia-Romero, A. McCree, G. Sell, J. Borgstrom, L. P. García-Perera, F. Richardson, R. Dehak, P. A. Torres-Carrasquillo, N. Dehak, State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and Speakers in the Wild evaluations. Computer Speech & Language, 60, 101026 (2020). https://doi.org/10.1016/j.csl.2019.101026

  72. P. B. D. Weenink, Praat Website. http://www.fon.hum.uva.nl/praat/ (2016)

  73. J.J. Wolf, Efficient acoustic parameters for speaker recognition. J. Acoust. Soc. Am. 51(6B), 2044–2056 (1972)

    Article  Google Scholar 

  74. E. Wong, J. Pelecanos, S. Myers, S. Sridharan, Language identification using efficient gaussian mixture model analysis. Aust. Int. Conf. Speech Sci. Technol. 2000, 78–83 (2000)

    Google Scholar 

Download references

Acknowledgements

This work is a part of the project namely “Deployment of Automatic Speaker Recognition System on Conversational Speech Data for North-Eastern states” funded by the Ministry of Electronics and Information Technology (MeitY), Govt. of India. The authors are thankful to the funding agency for their support and cooperation. The authors would like to record their deep appreciation for the unstinted support and cooperation of the authorities and students of different linguistic groups of North Eastern Regional Institute of Science and Technology (NERIST), Arunachal Pradesh, India, during the collection of data. The authors also acknowledge the contribution of the native speakers of E&NE Indian states, who have participated in the data collection task. The authors are thankful to the Centre for Development of Advanced Computing (CDAC), Kolkata, India, for the necessary support to carry out the research activity.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joyanta Basu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Basu, J., Khan, S., Roy, R. et al. Multilingual Speech Corpus in Low-Resource Eastern and Northeastern Indian Languages for Speaker and Language Identification. Circuits Syst Signal Process 40, 4986–5013 (2021). https://doi.org/10.1007/s00034-021-01704-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-021-01704-x

Keywords

Navigation