Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset,EURASIP Journal on Audio, Speech, and Music Processing

当前位置： X-MOL 学术 › EURASIP J. Audio Speech Music Proc. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset
EURASIP Journal on Audio, Speech, and Music Processing ( IF 2.4 ) Pub Date : 2019-06-17 , DOI: 10.1186/s13636-019-0152-1
Diego de Benito-Gorron , Alicia Lozano-Diez , Doroteo T. Toledano , Joaquin Gonzalez-Rodriguez

Audio signals represent a wide diversity of acoustic events, from background environmental noise to spoken communication. Machine learning models such as neural networks have already been proposed for audio signal modeling, where recurrent structures can take advantage of temporal dependencies. This work aims to study the implementation of several neural network-based systems for speech and music event detection over a collection of 77,937 10-second audio segments (216 h), selected from the Google AudioSet dataset. These segments belong to YouTube videos and have been represented as mel-spectrograms. We propose and compare two approaches. The first one is the training of two different neural networks, one for speech detection and another for music detection. The second approach consists on training a single neural network to tackle both tasks at the same time. The studied architectures include fully connected, convolutional and LSTM (long short-term memory) recurrent networks. Comparative results are provided in terms of classification performance and model complexity. We would like to highlight the performance of convolutional architectures, specially in combination with an LSTM stage. The hybrid convolutional-LSTM models achieve the best overall results (85% accuracy) in the three proposed tasks. Furthermore, a distractor analysis of the results has been carried out in order to identify which events in the ontology are the most harmful for the performance of the models, showing some difficult scenarios for the detection of music and speech.

中文翻译：

在大型音频数据集中探索用于语音和音乐检测的卷积、循环和混合深度神经网络

音频信号代表了各种各样的声学事件，从背景环境噪声到语音交流。已经提出了用于音频信号建模的机器学习模型，例如神经网络，其中循环结构可以利用时间依赖性。这项工作旨在研究几种基于神经网络的系统在从 Google AudioSet 数据集选择的 77,937 个 10 秒音频片段（216 小时）的集合中实现语音和音乐事件检测。这些片段属于 YouTube 视频，并已表示为梅尔谱图。我们提出并比较了两种方法。第一个是训练两个不同的神经网络，一个用于语音检测，另一个用于音乐检测。第二种方法包括训练单个神经网络以同时处理这两项任务。研究的架构包括全连接、卷积和 LSTM（长短期记忆）循环网络。在分类性能和模型复杂性方面提供了比较结果。我们想强调卷积架构的性能，特别是与 LSTM 阶段的结合。混合卷积-LSTM 模型在三个提议的任务中取得了最佳的总体结果（85% 的准确率）。此外，还对结果进行了干扰分析，以确定本体中的哪些事件对模型的性能最有害，显示了音乐和语音检测的一些困难场景。研究的架构包括全连接、卷积和 LSTM（长短期记忆）循环网络。在分类性能和模型复杂性方面提供了比较结果。我们想强调卷积架构的性能，特别是与 LSTM 阶段的结合。混合卷积-LSTM 模型在三个提议的任务中取得了最佳的总体结果（85% 的准确率）。此外，还对结果进行了干扰分析，以确定本体中的哪些事件对模型的性能最有害，显示了音乐和语音检测的一些困难场景。研究的架构包括全连接、卷积和 LSTM（长短期记忆）循环网络。在分类性能和模型复杂性方面提供了比较结果。我们想强调卷积架构的性能，特别是与 LSTM 阶段的结合。混合卷积-LSTM 模型在三个提议的任务中取得了最佳的总体结果（85% 的准确率）。此外，还对结果进行了干扰分析，以确定本体中的哪些事件对模型的性能最有害，显示了音乐和语音检测的一些困难场景。在分类性能和模型复杂性方面提供了比较结果。我们想强调卷积架构的性能，特别是与 LSTM 阶段的结合。混合卷积-LSTM 模型在三个提议的任务中取得了最佳的整体结果（85% 的准确率）。此外，还对结果进行了干扰分析，以确定本体中的哪些事件对模型的性能最有害，显示了音乐和语音检测的一些困难场景。在分类性能和模型复杂性方面提供了比较结果。我们想强调卷积架构的性能，特别是与 LSTM 阶段的结合。混合卷积-LSTM 模型在三个提议的任务中取得了最佳的总体结果（85% 的准确率）。此外，还对结果进行了干扰分析，以确定本体中的哪些事件对模型的性能最有害，显示了音乐和语音检测的一些困难场景。

更新日期：2019-06-17

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>