Abstract
In the recent years, extensive researches have been performed on various possible implementations of speaker diarization systems. These systems require efficient clustering algorithms in order to improve their performances in real-time processing. Teaching–learning-based optimization (TLBO) is such clustering algorithm which can be used to resolve the problem to the optimum clustering in a reasonable time. In this paper, a real-time implementation of speaker diarization (SD) system on raspberry pi 3 (RPi 3) using TLBO technique as classifier has been performed. This system has been evaluated on broadcasting radio dataset (NDTV), and the experimental tests have shown that this technique has succeeded to achieve acceptable performances in terms of diarization error rate (DER = 21.90% and 35% in single- and cross-show diarization, respectively), accuracy (87.30%), and real-time factor (RTF = 2.40). Also, we have tested TLBO technique on a 2.4 GHz Intel Core i5 processor using REPERE corpus. Thus, ameliorated results have been obtained in terms of execution time (xRT) and DER in both tasks of single- and cross-show speaker diarization (0.08 and 0.095, and 18.50% and 26.30%, respectively).
Similar content being viewed by others
References
C. Anandaraman, An improved sheep flock heredity algorithm for job shop scheduling and flow shop scheduling problems. Int. J. Ind. Eng. Comput. 2(4), 749–764 (2011)
X. Anguera et al., Speaker diarization: a review of recent research. IEEE Trans. Audio Speech Lang. Process. 20(2), 356–370 (2012)
X. Anguera, C. Wooters, B. Peskin, M. Aguilo, Robust speaker segmentation for meetings: the ICSI-SRI spring 2005 diarization system, in International Workshop on Machine Learning for Multimodal Interaction, (Springer, Heidelberg, 2005), pp. 402–414
K. Asanovic, R. Bodik, B.C. Catanzaro, J.J. Gebis, P. Husbands, K. Keutzer, D.A. Patterson, W.L. Plishker, J. Shalf, S.W. Williams, K.A. Yelick, The landscape of parallel computing research: a view from berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley (2006)
C. Barras, X. Zhu, S. Meignier, J. Gauvain, Multistage speaker diarization of broadcast news. IEEE Trans. Audio Speech Lang. Process. 14(5), 1505–1512 (2006)
A. Baykasoğlu, A. Hamzadayi, S.Y. Köse, Testing the performance of teaching–learning based optimization (TLBO) algorithm on combinatorial problems: flow shop and job shop scheduling cases. Inf. Sci. 276, 204–218 (2014)
D. Charlet, C. Barras, J.-S. Lienard, Impact of overlapping speech detection on speaker diarization for broadcast news and debates, in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE, 2013), pp. 7707–7711
S.S. Chen, P. S. Gopalakrishnan, Speaker, environment and channel change detection and clustering via the bayesian information criterion, in Proceedings of DARPA Broadcast News Transcription and Understanding Workshop (1998), pp. 127–132
S. Cheng, H. Min Wang, H. Fu, BIC-based speaker segmentation using divide-and-conquer strategies with application to speaker diarization. IEEE Trans. Audio Speech Lang. Process. 18(1), 141–157 (2009)
J. Chong, E. Gonina, Y. Yi, K. Keutzer, A fully data parallel WFST-based large vocabulary continuous speech recognition on a graphics processing unit, in Tenth Annual Conference of the International Speech Communication Association (2009)
J. Chong, Y. Yi, N.R.S.A. Faria, K. Keutzer, Data-parallel large vocabulary continuous speech recognition on graphics processors, in Proceedings of the 1st Annual Workshop on Emerging Applications and Many Core Architecture (2008), pp. 23–35
K. Church, W. Zhu, J. Vopicka, J. Pelecanos, D. Dimitriadis, P. Fousek, Speaker diarization: a perspective on challenges and opportunities from theory to practice, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (IEEE, 2017), pp. 4950–4954
K. Dabbabi, S. Hajji, A. Cherif, Integration of evolutionary computation algorithms and new AUTO-TLBO technique in the speaker clustering stage for speaker diarization of broadcast news. EURASIP J. Audio Speech Music Process. 2017(1), 21 (2017)
G. Dahl, Yu. Dong, D. Li, A. Alex, Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 30–42 (2011)
H. Delgado, X. Anguera, C. Fredouille, J. Serrano, Fast single-and cross-show speaker diarization using binary key speaker modeling. IEEE/ACM Trans. Audio Speech Lang. Process. 23(12), 2286–2297 (2015)
D. Dimitriadis, P. Fousek, Y. Heights, Developing on-line speaker diarization system, in INTERSPEECH (2017), pp. 2739–2743
P.R. Dixon, T. Oonishi, S. Furui, Fast acoustic computations using graphics processors, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Taipei, Taiwan (2009)
H. Do, H. Silverman, SRP-PHAT methods of locating simultaneous multiple talkers using a frame of microphone array data, in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, (IEEE, 2010), pp. 125–128
G. Dupuy, S. Meignier, P. Deléglise, Y. Estève, Recent improvements on ILP-based clustering for broadcast news speaker diarization (2014)
R.J. Edd, M. Rziza, D. Aboutajdine, M. Gelgon, J. Martinez, Fast incremental clustering of Gaussian mixture speaker models for scaling up retrieval in on-line broadcast, in 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, (IEEE, 2006), p. V
A. Firoozabadi, H. Abutalebi, Combination of nested microphone array and subband processing for multiple simultaneous speaker localization, in 6th International Symposium on Telecommunications (IST), (IEEE, 2012), pp. 907–912
A. Firoozabadi, H. Abutalebi, Localization of multiple simultaneous speakers by combining the information from different subbands. in 2013 21st Iranian Conference on Electrical Engineering (ICEE), (IEEE, 2013), pp. 1–6
O. Galibert, J. Kahn. The first official repere evaluation, in First Workshop on Speech, Language and Audio in Multimedia (2013)
T. Giannakopoulos, pyaudioanalysis: an open-source python library for audio signal analysis. PLoS ONE 10(12), e0144610 (2015)
A. Giraudel, M. Carré, V. Mapelli, J. Kahn, O. Galibert, L. Quintard, The REPERE Corpus: a multimodal corpus for person recognition, in LREC (2012), pp. 1102–1107
E. Gonina, G.Friedland, H. Cook, K. Keutzer, Fast speaker diarization using a high-level scripting language, in 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, (IEEE, 2011), pp. 553–558
H. Gyulyustan, S. Enkov, Experimental speech recognition system based on Raspberry Pi 3. IOSR J. Comput. Eng. (IOSR-JCE) 19(3), 107–112 (2017)
T. Herbig, F. Gerl, W. Minker, Self-learning speaker identification for enhanced speech recognition. Comput. Speech Lang. 26(3), 210–227 (2012)
S. Ishikawa, K. Yamabana, R. Isotani, A. Okumura, Parallel LVCSR algorithm for cellphone-oriented multicore processors, in 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, (IEEE, 2006), p. l
K.R. Krishnamachari, R.E. Yantorno, D.S. Benincasa, S.J. Wenndt, Spectral autocorrelation ratio as a usability measure of speech segments under co-channel conditions, in IEEE International Symposium on Intelligent Signal Processing and Communication Systems (2000), pp. 710–713
N. Kumar, S. Satoor, I. Buck, Fast parallel expectation maximization for Gaussian mixture models on GPUs using CUDA, in 2009 11th IEEE International Conference on High Performance Computing and Communications, (IEEE, 2009), pp. 103–109
S. Kwon, S. Narayanan, A study of generic models for unsupervised on-line speaker indexing, in 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No. 03EX721), (IEEE, 2003), pp. 423–428
S. Kwon, S. Narayanan, Unsupervised speaker indexing using generic models. IEEE Trans. Speech Audio Process. 13(5), 1004–1013 (2005)
J.P. LeBlanc, P.L. De Leon, Speech separation by kurtosis maximization, in Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181), (IEEE, 1998), pp. 1029–1032
M. Li, K.J. Han, S. Narayanan, Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Comput. Speech Lang. 27(1), 151–167 (2013)
L. Linna, W. Weng, Sh. Fujimura, An improved teaching-learning-based optimization algorithm to solve job shop scheduling problems, in 2017 IEEE/ACIS 16th International Conference on Computer and Information Science (ICIS), (IEEE, 2017), pp. 797–801
H.K. Maganti, P. Motlicek, D. Gatica-Perez, Unsupervised speech/non-speech detection for automatic speech recognition in meeting rooms, in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, (IEEE, 2007), pp. IV-1037–IV-1040
K. Markov, S. Nakamura, Never-ending learning system for on-line speaker diarization, in 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), (IEEE, 2007), pp. 699–704
M. Moattar, M. Homayounpour, A review on speaker diarization systems and approaches. Speech Commun. 54(10), 1065–1103 (2012)
A. Noulas, B.J.A. Krose, On-line multi-modal speaker diarization, in Proceedings of 9th International Conference on Multimodal Interfaces (2007), pp. 350–357
G. Onwubolu, D. Davendra, Scheduling flow shops using differential evolution algorithm. Eur. J. Oper. Res. 171(2), 674–692 (2006)
D. Pelleg, A. Moore, Extending k-means with efficient estimation of the number of clusters, in ICML, (2000), pp. 727–734
T. Pfau, D. Ellis, A. Stolcke, Multispeaker speech activity detection for the ICSI meeting recorder, in IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU’01, (IEEE, 2001), pp. 107–110
S.A. Rahat, A. Imteaj, T. Rahman, An IoT based interactive speech recognizable robot with distance control using Raspberry Pi, in 2018 International Conference on Innovations in Science, Engineering and Technology (ICISET), (IEEE, 2018), pp. 480–485
R. Ravipudi, V. Vimal, J. Savsani, D.P. Vakharia, Teaching–learning-based optimization: a novel method for constrained mechanical design optimization problems. Comput. Aided Des. 43(3), 303–315 (2011)
D. Reynolds, P. Torres-Carrasquillo, Approaches and applications of audio diarization, in Proceedings. (ICASSP’05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. (IEEE, 2005), pp. v/953–v/956 Vol. 5
M. Rouvier, G. Dupuy, P. Gay, E. Khoury, T. Merlin, S. Meignier, An open-source state-of-the-art toolbox for broadcast news diarization (2013)
J. Schmalenstroeer, M. Kelling, V. Leutnant, R. Haeb-Umbach, Fusing audio and video information for online speaker diarization, in Tenth Annual Conference of the International Speech Communication Association (2009)
M. Taghizadeh, P. Garner, H. Bourlard, H. Abutalebi, A. Asaei, An integrated framework for multi-channel multi-source localization and voice activity detection, in 2011 Joint Workshop on Hands-free Speech Communication and Microphone Arrays, (IEEE, 2011), pp. 92–97
S. Thiyagarajan, G. Saravana Kumar, E. Praveen Kumar, G. Sakana, Implementation of optical character recognition using Raspberry Pi for visually challenged person. Int. J. Eng. Technol. 7(3.34), 65–67 (2018)
P. Tiawongsombat, M.-H. Jeong, J.-S. Yun, B.-J. You, S.-R. Oh, Robust visual speakingness detection using bi-level HMM. Pattern Recogn. 45(2), 783–793 (2012)
C. Vaquero, O. Vinyals, G. Friedland, A hybrid approach to online speaker diarization, in Eleventh Annual Conference of the International Speech Communication Association (2010)
D. Vijayasenan, F. Valente, H. Bourlard, An information theoretic approach to speaker diarization of meeting data. IEEE Trans. Audio Speech Lang. Process. 17(7), 1382–1393 (2009)
J. Walsh, Y. Kim, T. Doll, Joint iterative multi-speaker identification and source separation using expectation propagation, in 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, (IEEE, 2007), pp. 283–286
Q. Wang, C. Downey, Li. Wan, Ph. Andrew, M. Ignacio, L. Moreno, Speaker diarization with LSTM, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, 2018), pp. 5239–5243
C. Wooters, M. Huijbregts, The ICSI RT07s speaker diarization system, in Multimodal Technologies for Perception of Humans. Springer, Berlin, Heidelberg, (2007), pp. 509–519
S.N. Wrigley, G.J. Brown, V. Wan, S. Renals, Speech and crosstalk detection in multichannel audio. IEEE Trans. Speech Audio Process. 13(1), 84–91 (2004)
K. You, J. Chong, Y. Yi, E. Gonina, C. Hughes, Y. Chen, W. Sung, K. Keutzer, Parallel scalability in speech recognition. IEEE Signal Process. Mag. 26(6), 124–135 (2009)
K. You, Y. Lee, W. Sung, OpenMP-based parallel implementation of a continuous speech recognizer on a multi-core system, in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, (IEEE, 2009), pp. 621–624
E. Yucesoy, V. Nabiyev, Gender identification of a speaker from voice source, in 2013 21st Signal Processing and Communications Applications Conference (SIU), (IEEE, 2013), pp. 1–4
M. Zelenak, C. Segura, J. Luque, J. Hernando, Simultaneous speech detection with spatial features for speaker diarization. IEEE Trans. Audio Speech Lang. Process. 20(2), 436–446 (2012)
W. Zhu, J. Pelecanos, Online speaker diarization using adapted i-vector transforms, in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, 2016), pp. 5045–5049
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Dabbabi, K., Hajji, S. & Cherif, A. Real-Time Implementation of Speaker Diarization System on Raspberry PI3 Using TLBO Clustering Algorithm. Circuits Syst Signal Process 39, 4094–4109 (2020). https://doi.org/10.1007/s00034-020-01357-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-020-01357-2