当前位置: X-MOL 学术Sādhanā › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Monophone-based connected word Hindi speech recognition improvement
Sādhanā ( IF 1.6 ) Pub Date : 2021-05-06 , DOI: 10.1007/s12046-021-01614-3
SHOBHA BHATT , ANURAG JAIN , AMITA DEV

In this paper, a model is proposed to improve monophone-based connected word speech recognition for the Hindi language by utilizing the Hidden Markov Model (HMM). The model consists of hybrid subword units and domain-specific syntactic structures. The hybrid units contain both phoneme- and syllable-based subword units. As the syllable-based subword units cover a larger acoustic span, contextual effects are reduced. The syllable-based acoustic units are applied for modelling only nasal sound in the hybrid model for improving the recognition score of a nasal sound. Further, improvement is proposed using syntactic structures in the grammar definition during the recognition process. Using the domain-specific syntactic structures in the grammar, the search space for the recognizer is reduced; consequently, the performance of the system is improved. For example, two grammar definitions (gram1) with no restriction and grammar(gram2) with domain-specific structures were applied. The speech recognition framework was implemented using the HMM-based toolkit HTK with five-state HMMs. The self-created connected word speech dataset is used with a vocabulary of 240 Hindi words. The Mel frequency cepstral coefficients (MFCCs), MFCCs with energy (MFCC_E), and perceptual linear prediction coefficients with energy (PLP_E) are utilized for feature extraction. Further, monophones were trained with and without using silence fixing to check the impact of short pauses on the recognizer’s performance. The system was tested for both speaker-dependent and speaker-independent modes. It was found that using a hybrid model and grammar(gram2) with silence fixing provided the best results. The system obtained an overall word accuracy of 80.28%, word correct of 80.28%, and a word error rate of 19.72% using MFCCs, gram2, phoneme-based modelling, and silence fixing. For the PLP_E coefficients, hybrid model, silence fixing, and gram2, the system obtained an overall word accuracy of 88.54%, word correct of 88.54%, and the word error rate of 11.46%.



中文翻译:

基于单音机的连接词印地语语音识别改进

本文提出了一种模型,通过利用隐马尔可夫模型(HMM)来改进基于单音节的印地语的连接词语音识别。该模型由混合子词单元和特定于域的句法结构组成。混合单元同时包含基于音素和音节的子词单元。由于基于音节的子词单元覆盖较大的声音范围,因此上下文效果会降低。基于音节的声学单元被用于在混合模型中仅对鼻音建模,以提高鼻音的识别分数。此外,提出了在识别过程中使用语法定义中的句法结构的改进。使用语法中特定于域的句法结构,可以减少识别器的搜索空间。因此,系统的性能得以提高。例如,应用了两个无限制的语法定义(gram1)和具有域特定结构的grammar(gram2)。语音识别框架是使用基于HMM的工具包HTK和五状态HMM来实现的。自行创建的关联单词语音数据集与240个印地语单词的词汇一起使用。梅尔频率倒谱系数(MFCC),带能量的MFCC(MFCC_E)和带能量的感知线性预测系数(PLP_E)用于特征提取。此外,对单音电话进行了带或不带静音修复的训练,以检查短暂的停顿对识别器性能的影响。该系统已针对说话者相关模式和说话者无关模式进行了测试。发现使用混合模型和带有静默固定的grammar(gram2)可获得最佳结果。使用MFCC,gram2,基于音素的建模和静默修复后,该系统的总体单词准确度为80.28%,单词正确度为80.28%,单词错误率为19.72%。对于PLP_E系数,混合模型,静默固定和gram2,系统获得的整体单词准确度为88.54%,单词正确度为88.54%,单词错误率为11.46%。

更新日期:2021-05-06
down
wechat
bug