A large TV dataset for speech and music activity detection,EURASIP Journal on Audio, Speech, and Music Processing

当前位置： X-MOL 学术 › EURASIP J. Audio Speech Music Proc. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A large TV dataset for speech and music activity detection
EURASIP Journal on Audio, Speech, and Music Processing ( IF 1.7 ) Pub Date : 2022-09-03 , DOI: 10.1186/s13636-022-00253-8
Yun-Ning Hung , Chih-Wei Wu , Iroro Orife , Aaron Hipple , William Wolcott , Alexander Lerch

Automatic speech and music activity detection (SMAD) is an enabling task that can help segment, index, and pre-process audio content in radio broadcast and TV programs. However, due to copyright concerns and the cost of manual annotation, the limited availability of diverse and sizeable datasets hinders the progress of state-of-the-art (SOTA) data-driven approaches. We address this challenge by presenting a large-scale dataset containing Mel spectrogram, VGGish, and MFCCs features extracted from around 1600 h of professionally produced audio tracks and their corresponding noisy labels indicating the approximate location of speech and music segments. The labels are several sources such as subtitles and cuesheet. A test set curated by human annotators is also included as a subset for evaluation. To validate the generalizability of the proposed dataset, we conduct several experiments comparing various model architectures and their variants under different conditions. The results suggest that our proposed dataset is able to serve as a reliable training resource and leads to SOTA performances on various public datasets. To the best of our knowledge, this dataset is the first large-scale, open-sourced dataset that contains features extracted from professionally produced audio tracks and their corresponding frame-level speech and music annotations.

中文翻译：

用于语音和音乐活动检测的大型电视数据集

自动语音和音乐活动检测 (SMAD) 是一项支持任务，可以帮助对无线电广播和电视节目中的音频内容进行分段、索引和预处理。然而，由于版权问题和手动注释的成本，多样化和大规模数据集的有限可用性阻碍了最先进 (SOTA) 数据驱动方法的进展。我们通过展示一个包含从大约 1600 小时专业制作的音轨及其相应的噪声标签中提取的 Mel 频谱图、VGGish 和 MFCC 特征来应对这一挑战，这些标签指示语音和音乐片段的大致位置。标签是几个来源，例如字幕和提示表。由人工注释者策划的测试集也作为评估的子集包含在内。为了验证所提出数据集的普遍性，我们进行了几个实验，比较了不同条件下的各种模型架构及其变体。结果表明，我们提出的数据集能够作为可靠的训练资源，并在各种公共数据集上实现 SOTA 性能。据我们所知，该数据集是第一个大规模的开源数据集，其中包含从专业制作的音轨及其相应的帧级语音和音乐注释中提取的特征。

更新日期：2022-09-03

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文