当前位置: X-MOL 学术Comput. Electr. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Psycho-acoustics inspired automatic speech recognition
Computers & Electrical Engineering ( IF 4.3 ) Pub Date : 2021-06-04 , DOI: 10.1016/j.compeleceng.2021.107238
Gianpaolo Coro , Fabio Valerio Massoli , Antonio Origlia , Francesco Cutugno

Understanding the human spoken language recognition process is still a far scientific goal. Nowadays, commercial automatic speech recognisers (ASRs) achieve high performance at recognising clean speech, but their approaches are poorly related to human speech recognition. They commonly process the phonetic structure of speech while neglecting supra-segmental and syllabic tracts integral to human speech recognition. As a result, these ASRs achieve low performance on spontaneous speech and require enormous costs to build up phonetic and pronunciation models and catch the large variability of human speech. This paper presents a novel ASR that addresses these issues and questions conventional ASR approaches. It uses alternative acoustic models and an exhaustive decoding algorithm to process speech at a syllabic temporal scale (100–250 ms) through a multi-temporal approach inspired by psycho-acoustic studies. Performance comparison on the recognition of spoken Italian numbers (from 0 to 1 million) demonstrates that our approach is cost-effective, outperforms standard phonetic models, and reaches state-of-the-art performance.



中文翻译:

心理声学启发自动语音识别

了解人类口语识别过程仍然是一个遥远的科学目标。如今,商用自动语音识别器 (ASR) 在识别干净语音方面取得了很高的性能,但它们的方法与人类语音识别的相关性较差。它们通常处理语音的语音结构,而忽略人类语音识别不可或缺的超音段和音节片段。因此,这些 ASR 在自发语音上的性能很低,并且需要巨大的成本来建立语音和发音模型并捕捉人类语音的巨大变化。本文提出了一种新颖的 ASR,它解决了这些问题,并对传统的 ASR 方法提出了质疑。它使用替代声学模型和详尽的解码算法,通过受心理声学研究启发的多时间方法处理音节时间尺度(100-250 毫秒)的语音。识别口语意大利数字(从 0 到 100 万)的性能比较表明,我们的方法具有成本效益,优于标准语音模型,并达到了最先进的性能。

更新日期:2021-06-04
down
wechat
bug