当前位置: X-MOL 学术Lang. Resour. Eval. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Modelling multi-level prosody and spectral features using deep neural network for an automatic tonal and non-tonal pre-classification-based Indian language identification system
Language Resources and Evaluation ( IF 1.7 ) Pub Date : 2021-01-20 , DOI: 10.1007/s10579-020-09527-z
Chuya China Bhanja , Mohammad Azharuddin Laskar , Rabul Hussain Laskar

In this paper an attempt has been made to prepare an automatic tonal and non-tonal pre-classification-based Indian language identification (LID) system using multi-level prosody and spectral features. Languages are first categorized into tonal and non-tonal groups, and then, from among the languages of the respective groups, individual languages are identified. The system uses syllable, word (tri-syllable) and phrase level (multi-word) prosody (collectively called multi-level prosody) along with spectral features, namely Mel-frequency cepstral coefficients (MFCCs), Mean Hilbert envelope coefficients (MHEC), and shifted delta cepstral coefficients of MFCCs and MHECs for the pre-classification task. Multi-level analysis of spectral features has also been proposed and the complementarity of the syllable, word and phrase level (spectral + prosody) has been examined for pre-classification-based LID task. Four different models, particularly, Gaussian Mixture Model (GMM)-Universal Background Model (UBM), Artificial Neural Network (ANN), i-vector based support vector machine (SVM) and Deep Neural Network (DNN) have been developed to identify the languages. Experiments have been carried out on National Institute of Technology Silchar language database (NITS-LD) and OGI Multi-language Telephone Speech corpus (OGI-MLTS). The experiments confirm that both prosody and (spectral + prosody) obtained from syllable-, word- and phrase-level carry complementary information for pre-classification-based LID task. At the pre-classification stage, DNN models based on multi-level (prosody + MFCC) features, coupled with score combination technique results in the lowest EER value of 9.6% for NITS-LD. For OGI-MLTS database, the lowest EER value of 10.2% is observed for multi-level (prosody + MHEC). The pre-classification module helps to improve the performance of baseline single-stage LID system by 3.2% and 4.2% for NITS-LD and OGI-MLTS database respectively.



中文翻译:

使用深度神经网络为基于音调和非音调的预分类自动印度语言识别系统建模多级韵律和频谱特征

在本文中,已经尝试使用多级韵律和频谱特征来准备一种基于音调和非音调的基于预分类的自动印度语言识别(LID)系统。首先将语言分为音调和非音调组,然后从各个组的语言中识别出各个语言。该系统使用音节,单词(三音节)和短语级别(多词)韵律(统称为多级韵律)以及频谱特征,即梅尔频率倒谱系数(MFCC),平均希尔伯特包络系数(MHEC) ,并为预分类任务移动了MFCC和MHEC的delta倒谱系数。还提出了频谱特征的多级分析方法,并且音节的互补性 单词和短语级别(频谱+韵律)已针对基于预分类的LID任务进行了检查。四种不同的模型,特别是高斯混合模型(GMM)-通用背景模型(UBM),人工神经网络(ANN),基于i向量的支持向量机(SVM)和深度神经网络(DNN)已被开发出来,用于识别语言。已经在国立技术学院Silchar语言数据库(NITS-LD)和OGI多语言电话语音语料库(OGI-MLTS)上进行了实验。实验证实,从音节,单词和短语级别获得的韵律和(频谱+韵律)都带有基于预分类的LID任务的补充信息。在预分类阶段,基于多级(韵律+ MFCC)功能的DNN模型,结合得分组合技术可得出NITS-LD的最低EER值为9.6%。对于OGI-MLTS数据库,多层(韵律+ MHEC)的EER值最低,为10.2%。对于NITS-LD和OGI-MLTS数据库,预分类模块有助于将基准单级LID系统的性能分别提高3.2%和4.2%。

更新日期:2021-01-20
down
wechat
bug