Abstract
Automatic Speech Recognition (ASR) is playing a vital role in a wide range of real-world applications. However, Commercial ASR solutions are typically “one-size-fits-all” products and clients are inevitably faced with the risk of severe performance degradation in field test. Meanwhile, with new data regulations such as the European Union’s General Data Protection Regulation (GDPR) coming into force, ASR vendors, which traditionally utilize the speech training data in a centralized approach, are becoming increasingly helpless to solve this problem, since accessing clients’ speech data is prohibited. Here, we show that by seamlessly integrating three machine learning paradigms (i.e., Transfer learning, Federated learning, and Evolutionary learning (TFE)), we can successfully build a win-win ecosystem for ASR clients and vendors and solve all the aforementioned problems plaguing them. Through large-scale quantitative experiments, we show that with TFE, the clients can enjoy far better ASR solutions than the “one-size-fits-all” counterpart, and the vendors can exploit the abundance of clients’ data to effectively refine their own ASR products.
- Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security. ACM, 308–318. Google ScholarDigital Library
- Victor Abrash, Horacio Franco, Ananth Sankar, and Michael Cohen. 1995. Connectionist speaker normalization and adaptation. In Proceedings of the European Conference on Speech Communication and Technology (Eurospeech’95). Citeseer.Google Scholar
- Harith Al-Sahaf, Ausama Al-Sahaf, Bing Xue, Mark Johnston, and Mengjie Zhang. 2017. Automatically evolving rotation-invariant texture image descriptors by genetic programming. IEEE Trans. Evolution. Comput. 21, 1 (2017), 83–101. Google ScholarDigital Library
- Wissam A. Albukhanajer, Johann A. Briffa, and Yaochu Jin. 2014. Evolutionary multiobjective image feature extraction in the presence of noise. IEEE Trans. Cybernet. 45, 9 (2014), 1757–1768.Google ScholarCross Ref
- Johes Bater, Xi He, William Ehrich, Ashwin Machanavajjhala, and Jennie Rogers. 2018. Shrinkwrap: Differentially-private query processing in private data federations. Retrieved from https://arXiv:1810.01816.Google ScholarDigital Library
- Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3(Feb.2003), 1137–1155. Google ScholarDigital Library
- Peva Blanchard, Rachid Guerraoui, Julien Stainer, et al. 2017. Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems. MIT Press, 119–129. Google ScholarDigital Library
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3 (2003), 993–1022. Google ScholarDigital Library
- Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H. Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. 2016. Practical secure aggregation for federated learning on user-held data. Retrieved from https://arXiv:1611.04482.Google Scholar
- Theodora S. Brisimi, Ruidi Chen, Theofanie Mela, Alex Olshevsky, Ioannis Ch Paschalidis, and Wei Shi. 2018. Federated learning of predictive models from federated Electronic Health Records. Int. J. Med. Info. 112 (2018), 59–67.Google ScholarCross Ref
- Armand R. Burks and William F. Punch. 2018. Genetic programming for tuberculosis screening from raw X-ray images. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO’18). 1214–1221. Google ScholarDigital Library
- Boyuan Chen, Harvey Wu, Warren Mo, Ishanu Chattopadhyay, and Hod Lipson. 2018. Autostacker: A compositional evolutionary learning system. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO’18). 402–409. Google ScholarDigital Library
- Kuan-Yu Chen, Hsuan-Sheng Chiu, and Berlin Chen. 2010. Latent topic modeling of word vicinity information for speech recognition. In Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP’10). IEEE, 5394–5397.Google ScholarCross Ref
- Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. 2017. Targeted backdoor attacks on deep learning systems using data poisoning. Retrieved from https://arXiv:1712.05526.Google Scholar
- Yiqiang Chen, Xin Qin, Jindong Wang, Chaohui Yu, and Wen Gao. 2020. Fedhealth: A federated transfer learning framework for wearable healthcare. IEEE Intell. Syst. 35, 4 (2020), 83–93.Google ScholarCross Ref
- Kewei Cheng, Tao Fan, Yilun Jin, Yang Liu, Tianjian Chen, and Qiang Yang. 2019. SecureBoost: A lossless federated learning framework. Retrieved from http://arxiv.org/abs/1901.08755.Google Scholar
- Alexandra Chronopoulou, Christos Baziotis, and Alexandros Potamianos. 2019. An embarrassingly simple approach for transfer learning from pretrained language models. Retrieved from https://arXiv:1902.10547.Google Scholar
- George E Dahl, Dong Yu, Li Deng, and Alex Acero. 2011. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio, Speech, Lang. Process. 20, 1 (2011), 30–42. Google ScholarDigital Library
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. Retrieved from https://arXiv:1810.04805.Google Scholar
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. Retrieved from https://arXiv:1810.04805.Google Scholar
- Cynthia Dwork. 2008. Differential privacy: A survey of results. In Proceedings of the Theory and Applications of Models of Computation 5th International Conference (TAMC’08). 1–19. Google ScholarDigital Library
- Cynthia Dwork, Aaron Roth, et al. 2014. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9, 3–4 (2014), 211–407. Google ScholarDigital Library
- Roberto Gemello, Franco Mana, Stefano Scanzio, Pietro Laface, and Renato De Mori. 2007. Linear hidden transformations for adaptation of hybrid ANN/HMM models. Speech Commun. 49, 10 (2007), 827–835. Google ScholarDigital Library
- Robin C. Geyer, Tassilo Klein, and Moin Nabi. 2017. Differentially private federated learning: A client level perspective. Retrieved from https://arXiv:1712.07557.Google Scholar
- Shweta Ghai and Rohit Sinha. 2016. Adaptive feature truncation to address acoustic mismatch in automatic recognition of children’s speech. APSIPA Trans. Signal Info. Process. 5 (2016).Google Scholar
- Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning. ACM, 369–376. Google ScholarDigital Library
- Xiawei Guo, Quanming Yao, WeiWei Tu, Yuqiang Chen, Wenyuan Dai, and Qiang Yang. 2018. Privacy-preserving Transfer Learning for Knowledge Sharing. Retrieved from https://arXiv:1811.09491.Google Scholar
- Jihun Hamm, Yingjun Cao, and Mikhail Belkin. 2016. Learning privately from multiparty data. In Proceedings of the International Conference on Machine Learning. 555–563. Google ScholarDigital Library
- Andrew Hard, Kanishka Rao, Rajiv Mathews, Françoise Beaufays, Sean Augenstein, Hubert Eichner, Chloé Kiddon, and Daniel Ramage. 2018. Federated learning for mobile keyboard prediction. Retrieved from https://arXiv:1811.03604.Google Scholar
- Stephen Hardy, Wilko Henecka, Hamish Ivey-Law, Richard Nock, Giorgio Patrini, Guillaume Smith, and Brian Thorne. 2017. Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption. Retrieved from https://arXiv:1711.10677.Google Scholar
- John H. Holland. 1992. Adaptation in Natural and Artificial Systems. MIT Press, Cambridge, MA. Google ScholarDigital Library
- Yan Huang, Dong Yu, Chaojun Liu, and Yifan Gong. 2014. Multi-accent deep neural network acoustic model with accent-specific top layer using the KLD-regularized model adaptation. In Proceedings of the 15th Annual Conference of the International Speech Communication Association.Google Scholar
- Josiah Jacobsen-Grocott, Yi Mei, Gang Chen, and Mengjie Zhang. 2017. Evolving heuristics for dynamic vehicle routing with time windows using genetic programming. In Proceedings of the IEEE Congress on Evolutionary Computation, (CEC’17). 1948–1955.Google ScholarCross Ref
- Yanfei Kang, Rob Hyndman, and Smith-Miles Kate. 2017. Visualising forecasting algorithm performance using time series instance spaces. Int. J. Forecast. 33, 2 (2017), 345–358.Google ScholarCross Ref
- Dietrich Klakow and Jochen Peters. 2002. Testing the correlation of word error rate and perplexity. Speech Commun. 38, 1–2 (2002), 19–28. Google ScholarDigital Library
- Roland Kuhn and Renato De Mori. 1990. A cache-based natural language model for speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. 12, 6 (1990), 570–583. Google ScholarDigital Library
- Raymond Lau, Ronald Rosenfeld, and Salim Roukos. 1993. Trigger-based language models: A maximum entropy approach. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2. IEEE, 45–48. Google ScholarDigital Library
- Joel Lehman, Jay Chen, Jeff Clune, and Kenneth O. Stanley. 2018. ES is more than just a traditional finite-difference approximator. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO’18). 450–457. Google ScholarDigital Library
- Bo Li and Khe Chai Sim. 2010. Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems. In Proceedings of the 11th Annual Conference of the International Speech Communication Association.Google Scholar
- Ke Li, Hainan Xu, Yiming Wang, Daniel Povey, and Sanjeev Khudanpur. 2018. Recurrent neural network language model adaptation for conversational speech recognition. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH’18).1–5.Google ScholarCross Ref
- Xiao Li and Jeff Bilmes. 2006. Regularized adaptation of discriminative classifiers. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’06), Vol. 1. IEEE, I–I.Google Scholar
- Yuyu Liang, Mengjie Zhang, and Will N. Browne. 2015. A supervised figure-ground segmentation method using genetic programming. In Proceedings of the European Conference on the Applications of Evolutionary Computation. 491–503.Google Scholar
- Yang Liu, Tianjian Chen, and Qiang Yang. 2018. Secure federated transfer learning. Retrieved from http://arxiv.org/abs/1812.03337.Google Scholar
- Yuxin Liu, Yi Mei, Mengjie Zhang, and Zili Zhang. 2017. Automated heuristic design using genetic programming hyper-heuristic for uncertain capacitated arc routing problem. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO’17). 290–297. Google ScholarDigital Library
- Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. 2017. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS’17). 1273–1282.Google Scholar
- H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. 2016. Communication-efficient learning of deep networks from decentralized data. Retrieved from https://arXiv:1602.05629.Google Scholar
- Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Proceedings of the 11th Annual Conference of the International Speech Communication Association.Google ScholarCross Ref
- Tomáš Mikolov, Stefan Kombrink, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2011. Extensions of recurrent neural network language model. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’11). IEEE, 5528–5531.Google ScholarCross Ref
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. MIT Press, 3111–3119. Google ScholarDigital Library
- David J. Montana and Lawrence Davis. 1989. Training feedforward neural networks using genetic algorithms. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’89). 762–767. Google ScholarDigital Library
- Frederic Morin and Yoshua Bengio. 2005. Hierarchical probabilistic neural network language model. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’05), Vol. 5. Citeseer, 246–252.Google Scholar
- Joao Neto, Luís Almeida, Mike Hochberg, Ciro Martins, Luis Nunes, Steve Renals, and Tony Robinson. 1995. Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system. In Proceedings of the European Conference on Speech Communication and Technology (Eurospeech’95). 2171–2174.Google Scholar
- Su Nguyen, Yi Mei, and Mengjie Zhang. 2017. Genetic programming for production scheduling: A survey with a unified framework. Complex Intell. Syst. 3, 1 (2017), 41–66.Google ScholarCross Ref
- Su Nguyen, Mengjie Zhang, Mark Johnston, and Kay Chen Tan. 2014. Automatic design of scheduling policies for dynamic multi-objective job shop scheduling via cooperative coevolution genetic programming. IEEE Trans. Evolution. Comput. 18, 2 (2014), 193–208. Google ScholarDigital Library
- Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 10 (2010), 1345–1359. DOI:https://doi.org/10.1109/TKDE.2009.191 Google ScholarDigital Library
- Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 10 (2010), 1345–1359. Google ScholarDigital Library
- Nicolas Papernot, Martín Abadi, Ulfar Erlingsson, Ian Goodfellow, and Kunal Talwar. 2016. Semi-supervised knowledge transfer for deep learning from private training data. Retrieved from https://arXiv:1610.05755.Google Scholar
- Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 1532–1543.Google ScholarCross Ref
- Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al. 2011. The Kaldi speech recognition toolkit. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society.Google Scholar
- Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. 2018. Regularized evolution for image classifier architecture search. Retrieved from https://arXiv:1802.01548.Google Scholar
- Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc V. Le, and Alexey Kurakin. 2017. Large-scale evolution of image classifiers. In Proceedings of the International Conference on Machine Learning (ICML’17). 2902–2911. Google ScholarDigital Library
- Ronald L. Rivest, Len Adleman, Michael L. Dertouzos, et al. 1978. On data banks and privacy homomorphisms. Found. Secure Comput. 4, 11 (1978), 169–180.Google Scholar
- Natasha Singh-Miller and Michael Collins. 2007. Trigger-based language modeling using a loss-sensitive perceptron algorithm. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’07), Vol. 4. IEEE, IV–25.Google ScholarCross Ref
- Ankur Sinha, Pekka Malo, and Timo Kuosmanen. 2015. A multiobjective exploratory procedure for regression model selection. J. Comput. Graphic. Stat. 24, 1 (2015), 154–182.Google ScholarCross Ref
- Shuang Song, Kamalika Chaudhuri, and Anand D. Sarwate. 2013. Stochastic gradient descent with differentially private updates. In Proceedings of the IEEE Global Conference on Signal and Information Processing. IEEE, 245–248.Google Scholar
- Andreas Stolcke. 2002. SRILM-an extensible language modeling toolkit. In Proceedings of the 7th International Conference on Spoken Language Processing.Google Scholar
- Andreas Stolcke and Jasha Droppo. 2017. Comparing human and machine errors in conversational speech transcription. In Proceedings of the Interspeech Conference. 137–141. https://academic.microsoft.com/paper/2963980299Google ScholarCross Ref
- Baochen Sun and Kate Saenko. 2016. Deep coral: Correlation alignment for deep domain adaptation. In Proceedings of the European Conference on Computer Vision. Springer, 443–450.Google ScholarCross Ref
- Yanan Sun, Gary G. Yen, and Zhang Yi. 2019. Evolving unsupervised deep neural networks for learning meaningful representations. IEEE Trans. Evolution. Comput. 23, 1 (2019), 89–103.Google ScholarCross Ref
- Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems. MIT Press, 3104–3112. Google ScholarDigital Library
- Jan Trmal, Jan Zelinka, and Luděk Müller. 2010. Adaptation of a feedforward artificial neural network using a linear transform. In Proceedings of the International Conference on Text, Speech and Dialogue. Springer, 423–430. Google ScholarDigital Library
- Paul Voigt and Axel Von dem Bussche. 2017. The EU general data protection regulation (GDPR). A Practical Guide, 1st ed. Springer International Publishing, Cham. Google ScholarDigital Library
- Jindong Wang, Yiqiang Chen, Wenjie Feng, Han Yu, Meiyu Huang, and Qiang Yang. 2020. Transfer learning with dynamic distribution adaptation. ACM Trans. Intell. Syst. Technol. 11, 1 (2020), 1–25. Google ScholarDigital Library
- Yang Wang, Quanquan Gu, and Donald Brown. 2018. Differentially private hypothesis transfer learning. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 811–826.Google Scholar
- Hainan Xu, Ke Li, Yiming Wang, Jian Wang, Shiyin Kang, Xie Chen, Daniel Povey, and Sanjeev Khudanpur. 2018. Neural network language modeling with letter-based features and importance sampling. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). IEEE, 6109–6113.Google ScholarCross Ref
- Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. 2019. Federated machine learning: Concept and applications. ACM Trans. Intell. Syst. Technol. 10, 2 (2019), 12. Google ScholarDigital Library
- Andrew Chi-Chih Yao. 1982. Protocols for secure computations. In Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS’82), Vol. 82. 160–164. Google ScholarDigital Library
- Jiangyan Yi, Hao Ni, Zhengqi Wen, Bin Liu, and Jianhua Tao. 2016. CTC regularized model adaptation for improving LSTM RNN based multi-accent Mandarin speech recognition. In Proceedings of the 10th International Symposium on Chinese Spoken Language Processing (ISCSLP’16). IEEE, 1–5.Google ScholarCross Ref
- Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems. MIT Press, 3320–3328. Google ScholarDigital Library
- Daniel Yska, Yi Mei, and Mengjie Zhang. 2018. Genetic programming hyper-heuristic with cooperative coevolution for dynamic flexible job shop scheduling. In Proceedings of the European Conference of Genetic Programming (EuroGP’18). 306–321.Google ScholarCross Ref
- Dong Yu and Li Deng. 2016. Automatic Speech Recognition.Springer.Google Scholar
- Dong Yu, Kaisheng Yao, Hang Su, Gang Li, and Frank Seide. 2013. KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13). IEEE, 7893–7897.Google ScholarCross Ref
- Chao Zhang, Zichao Yang, Xiaodong He, and Li Deng. 2019. Multimodal intelligence: Representation learning, information fusion, and applications. Retrieved from https://arXiv:1911.03977.Google Scholar
- Hangyu Zhu and Yaochu Jin. 2019. Multi-objective evolutionary federated learning. IEEE Trans. Neural Netw. Learn. Syst. 31, 4 (2019), 1310–1322.Google ScholarCross Ref
- Yuze Zou, Shaohan Feng, Dusit Niyato, Yutao Jiao, Shimin Gong, and Wenqing Cheng. 2019. Mobile device training strategies in federated learning: An evolutionary game approach. In Proceedings of the International Conference on Internet of Things (iThings’19) and IEEE Green Computing and Communications (GreenCom’19) and IEEE Cyber, Physical and Social Computing (CPSCom’19) and IEEE Smart Data (SmartData’19). IEEE, 874–879.Google Scholar
Index Terms
- A GDPR-compliant Ecosystem for Speech Recognition with Transfer, Federated, and Evolutionary Learning
Recommendations
Data Augmentation Techniques for Transfer Learning-Based Continuous Dysarthric Speech Recognition
AbstractData augmentation is an essential component in building a dysarthric speech recognition system, as speech data collection from dysarthric speakers with varying degree of disorder is difficult. Dysarthric speech recognition systems are mostly built ...
Speech-Input Speech-Output Communication for Dysarthric Speakers Using HMM-Based Speech Recognition and Adaptive Synthesis System
Dysarthria is a motor speech disorder that causes inability to control and coordinate one or more articulators. This makes it difficult for a dysarthric speaker to utter certain speech sound units, thereby producing poorly articulated, slurred, and ...
A Platform for Deploying the TFE Ecosystem of Automatic Speech Recognition
MM '22: Proceedings of the 30th ACM International Conference on MultimediaSince data regulations such as the European Union's General Data Protection Regulation (GDPR) have taken effect, the traditional two-step Automatic Speech Recognition (ASR) optimization strategy (i.e., training a one-size-fits-all model with vendor's ...
Comments