当前位置: X-MOL 学术Hum. Cent. Comput. Inf. Sci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Comparative study of singing voice detection based on deep neural networks and ensemble learning
Human-centric Computing and Information Sciences ( IF 3.9 ) Pub Date : 2018-11-26 , DOI: 10.1186/s13673-018-0158-1
Shingchern D. You , Chien-Hung Liu , Woei-Kae Chen

This paper investigates various structures of neural network models and various types of stacked ensembles for singing voice detection. The studied models include convolutional neural networks (CNN), long short term memory (LSTM) model, convolutional LSTM model, and capsule net. The input features to the network models are MFCC (mel-frequency cepstrum coefficients), spectrogram from short-time Fourier transformation, or raw PCM samples. The simulation results show that CNN model with spectrogram inputs yields higher detection accuracy, up to 91.8% for Jamendo dataset. Among the studied stacked ensemble methods, performing voting strategy yields comparable performance as the other methods, but with much lower computational cost. By voting with five models, the accuracy reaches 94.2% for Jamendo dataset.

中文翻译:

基于深度神经网络和集成学习的歌声检测比较研究

本文研究了神经网络模型的各种结构以及用于歌唱语音检测的各种类型的堆叠乐团。研究的模型包括卷积神经网络(CNN),长期短期记忆(LSTM)模型,卷积LSTM模型和胶囊网。网络模型的输入功能是MFCC(梅尔频率倒谱系数),短时傅立叶变换产生的频谱图或原始PCM样本。仿真结果表明,带有频谱图输入的CNN模型具有更高的检测精度,对于Jamendo数据集,检测精度高达91.8%。在研究的堆叠集成方法中,执行投票策略可产生与其他方法相当的性能,但计算成本却低得多。通过对五个模型进行投票,Jamendo数据集的准确性达到94.2%。
更新日期:2018-11-26
down
wechat
bug