当前位置: X-MOL 学术Expert Syst. Appl. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Optimisation of phonetic aware speech recognition through multi-objective evolutionary algorithms
Expert Systems with Applications ( IF 7.5 ) Pub Date : 2020-03-24 , DOI: 10.1016/j.eswa.2020.113402
Jordan J. Bird , Elizabeth Wanner , Anikó Ekárt , Diego R. Faria

Recent advances in the availability of computational resources allow for more sophisticated approaches to speech recognition than ever before. This study considers Artificial Neural Network and Hidden Markov Model methods of classification for Human Speech Recognition through Diphthong Vowel sounds in the English Phonetic Alphabet rather than the classical approach of the classification of whole words and phrases, with a specific focus on both single and multi-objective evolutionary optimisation of bioinspired classification methods. A set of audio clips are recorded by subjects from the United Kingdom and Mexico and the recordings are transformed into a static dataset of statistics by way of their Mel-Frequency Cepstral Coefficients (MFCC) at sliding window length of 200ms as well as a reshaped MFCC timeseries format for forecast-based models. An deep neural network with evolutionary optimised topology achieves 90.77% phoneme classification accuracy in comparison to the best HMM that achieves 86.23% accuracy with 150 hidden units, when only accuracy is considered in a single-objective optimisation approach. The obtained solutions are far more complex than the HMM taking around 248 seconds to train on powerful hardware versus 160 for the HMM. A multi-objective approach is explored due to this. In the multi-objective approaches of scalarisation presented, within which real-time resource usage is also considered towards solution fitness, far more optimal solutions are produced which train far quicker than the forecast approach (69 seconds) with classification ability retained (86.73%). Weightings towards either maximising accuracy or reducing resource usage from 0.1 to 0.9 are suggested depending on the resources available, since many future IoT devices and autonomous robots may have limited access to cloud resources at a premium in comparison to the GPU used in this experiment.



中文翻译:

通过多目标进化算法优化语音感知语音识别

在计算资源的可用性方面的最新进展允许比以往任何时候都更复杂的语音识别方法。本研究考虑了使用人工神经网络和隐马尔可夫模型对英语语音字母表中的双语音元音进行语音识别的分类方法,而不是对整个单词和短语进行分类的经典方法,特别是针对单个和多个启发式分类方法的客观进化优化。来自英国和墨西哥的对象录制了一组音频片段,并通过其滑动窗口长度为200ms的梅尔频率倒谱系数(MFCC)以及经过重塑的MFCC将记录转换为静态统计数据集基于预测的模型的时间序列格式。与最佳HMM相比,具有进化优化拓扑的深度神经网络可达到90.77%的音素分类精度,而在单目标优化方法中仅考虑精度时,最佳HMM可以在150个隐藏单元的情况下达到86.23%的精度。所获得的解决方案比HMM复杂得多,在强大的硬件上进行培训大约需要248秒,而HMM需要160秒。因此,探索了一种多目标方法。在提出的多目标标量方法中,还考虑了实时资源使用,以达到解决方案适合度的要求,产生了许多最优解决方案,它们的训练速度比预测方法(69秒)快得多,并且保留了分类能力(86.73%) 。权重最大化精度或将资源使用率从0.1减少到0。

更新日期:2020-03-24
down
wechat
bug