当前位置: X-MOL 学术Math. Probl. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Multimodal Music Emotion Classification Method Based on Multifeature Combined Network Classifier
Mathematical Problems in Engineering Pub Date : 2020-08-01 , DOI: 10.1155/2020/4606027
Changfeng Chen 1 , Qiang Li 1
Affiliation  

Aiming at the shortcomings of single network classification model, this paper applies CNN-LSTM (convolutional neural networks-long short-term memory) combined network in the field of music emotion classification and proposes a multifeature combined network classifier based on CNN-LSTM which combines 2D (two-dimensional) feature input through CNN-LSTM and 1D (single-dimensional) feature input through DNN (deep neural networks) to make up for the deficiencies of original single feature models. The model uses multiple convolution kernels in CNN for 2D feature extraction, BiLSTM (bidirectional LSTM) for serialization processing and is used, respectively, for audio and lyrics single-modal emotion classification output. In the audio feature extraction, music audio is finely divided and the human voice is separated to obtain pure background sound clips; the spectrogram and LLDs (Low Level Descriptors) are extracted therefrom. In the lyrics feature extraction, the chi-squared test vector and word embedding extracted by Word2vec are, respectively, used as the feature representation of the lyrics. Combining the two types of heterogeneous features selected by audio and lyrics through the classification model can improve the classification performance. In order to fuse the emotional information of the two modals of music audio and lyrics, this paper proposes a multimodal ensemble learning method based on stacking, which is different from existing feature-level and decision-level fusion methods, the method avoids information loss caused by direct dimensionality reduction, and the original features are converted into label results for fusion, effectively solving the problem of feature heterogeneity. Experiments on million song dataset show that the audio classification accuracy of the multifeature combined network classifier in this paper reaches 68%, and the lyrics classification accuracy reaches 74%. The average classification accuracy of the multimodal reaches 78%, which is significantly improved compared with the single-modal.

中文翻译:

基于多特征组合网络分类器的多模式音乐情感分类方法

针对单网络分类模型的不足,本文将CNN-LSTM(卷积神经网络-长短期记忆)组合网络应用于音乐情感分类领域,提出了一种基于CNN-LSTM的多特征组合网络分类器。通过CNN-LSTM输入2D(二维)特征,通过DNN(深度神经网络)输入1D(单维)特征,以弥补原始单特征模型的不足。该模型在CNN中使用多个卷积核进行2D特征提取,在序列化处理中使用BiLSTM(双向LSTM),分别用于音频和歌词单模态情感分类输出。在音频特征提取中,将音乐音频细分并分离人声以获得纯净的背景声音片段;从中提取频谱图和LLD(低级描述符)。在歌词特征提取中,通过Word2vec提取的卡方测试向量和词嵌入分别用作歌词的特征表示。通过分类模型将音频和歌词选择的两种异质特征进行组合,可以提高分类性能。为了融合音乐音频和歌词两种模式的情感信息,提出了一种基于叠加的多模式集成学习方法,该方法不同于现有的特征级和决策级融合方法,避免了信息丢失通过直接降维,将原始特征转换为标注结果进行融合,有效解决了特征异质性问题。对百万首歌曲数据集进行的实验表明,本文的多功能组合网络分类器的音频分类精度达到68%,歌词分类精度达到74%。多模式的平均分类准确率达到78%,与单模式相比明显提高。
更新日期:2020-08-01
down
wechat
bug