Music detection from broadcast contents using convolutional neural networks with a Mel-scale kernel,EURASIP Journal on Audio, Speech, and Music Processing

当前位置： X-MOL 学术 › EURASIP J. Audio Speech Music Proc. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Music detection from broadcast contents using convolutional neural networks with a Mel-scale kernel
EURASIP Journal on Audio, Speech, and Music Processing ( IF 1.7 ) Pub Date : 2019-06-26 , DOI: 10.1186/s13636-019-0155-y
Byeong-Yong Jang , Woon-Haeng Heo , Jung-Hyun Kim , Oh-Wook Kwon

We propose a new method for music detection from broadcasting contents using the convolutional neural networks with a Mel-scale kernel. In this detection task, music segments should be annotated from the broadcast data, where music, speech, and noise are mixed. The convolutional neural network is composed of a convolutional layer with kernel that is trained to extract robust features. The Mel-scale changes the kernel size, and the backpropagation algorithm trains the kernel shape. We used 52 h of mixed broadcast data (25 h of music) to train the convolutional network and 24 h of collected broadcast data (ratio of music of 50–76%) for testing. The test data consisted of various genres (drama, documentary, news, kids, reality, and so on) that are broadcast in British English, Spanish, and Korean languages. The proposed method consistently showed better performance in all the three languages than the baseline system, and the F-score ranged from 86.5% for British data to 95.9% for Korean drama data. Our music detection system takes about 28 s to process a 1-min signal using only one CPU with 4 cores.

中文翻译：

使用具有 Mel 尺度内核的卷积神经网络从广播内容中检测音乐

我们提出了一种使用具有 Mel 尺度内核的卷积神经网络从广播内容中检测音乐的新方法。在这个检测任务中，音乐片段应该从混合了音乐、语音和噪声的广播数据中进行注释。卷积神经网络由带有内核的卷积层组成，经过训练以提取鲁棒的特征。Mel-scale 改变内核大小，反向传播算法训练内核形状。我们使用 52 小时混合广播数据（25 小时音乐）来训练卷积网络，并使用 24 小时收集的广播数据（音乐比例为 50-76%）进行测试。测试数据包括以英式英语、西班牙语和韩语播放的各种类型（戏剧、纪录片、新闻、儿童、真人秀等）。所提出的方法在所有三种语言中都表现出比基线系统更好的性能，F 分数范围从英国数据的 86.5% 到韩剧数据的 95.9%。我们的音乐检测系统仅使用一个 4 核 CPU 就需要大约 28 秒来处理 1 分钟的信号。

更新日期：2019-06-26

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文