当前位置: X-MOL 学术EURASIP J. Audio Speech Music Proc. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices
EURASIP Journal on Audio, Speech, and Music Processing ( IF 2.4 ) Pub Date : 2021-02-03 , DOI: 10.1186/s13636-020-00194-0
Rajat Hebbar , Pavlos Papadopoulos , Ramon Reyes , Alexander F. Danvers , Angelina J. Polsinelli , Suzanne A. Moseley , David A. Sbarra , Matthias R. Mehl , Shrikanth Narayanan

Over the recent years, machine learning techniques have been employed to produce state-of-the-art results in several audio related tasks. The success of these approaches has been largely due to access to large amounts of open-source datasets and enhancement of computational resources. However, a shortcoming of these methods is that they often fail to generalize well to tasks from real life scenarios, due to domain mismatch. One such task is foreground speech detection from wearable audio devices. Several interfering factors such as dynamically varying environmental conditions, including background speakers, TV, or radio audio, render foreground speech detection to be a challenging task. Moreover, obtaining precise moment-to-moment annotations of audio streams for analysis and model training is also time-consuming and costly. In this work, we use multiple instance learning (MIL) to facilitate development of such models using annotations available at a lower time-resolution (coarsely labeled). We show how MIL can be applied to localize foreground speech in coarsely labeled audio and show both bag-level and instance-level results. We also study different pooling methods and how they can be adapted to densely distributed events as observed in our application. Finally, we show improvements using speech activity detection embeddings as features for foreground detection.

中文翻译:

深度多实例学习,用于可穿戴设备中环境音频中的前景语音定位

近年来,机器学习技术已被用于在一些与音频相关的任务中产生最先进的结果。这些方法的成功很大程度上归因于对大量开源数据集的访问和计算资源的增强。但是,这些方法的缺点是由于域不匹配,它们通常无法很好地概括到现实生活中的任务。一种这样的任务是来自可穿戴音频设备的前景语音检测。诸如背景扬声器,电视或广播音频等动态变化的环境条件等多种干扰因素使前景语音检测成为一项艰巨的任务。此外,获得音频流的精确的瞬间注释以进行分析和模型训练也是费时且昂贵的。在这项工作中 我们使用多实例学习(MIL),以较低的时间分辨率(粗略标注)使用注释来促进此类模型的开发。我们展示了如何将MIL应用于在粗略标记的音频中定位前景语音,并显示包级别和实例级别的结果。我们还研究了不同的合并方法,以及如何适应我们的应用程序中观察到的密集分布事件。最后,我们展示了使用语音活动检测嵌入作为前景检测的功能的改进。我们还研究了不同的合并方法,以及如何适应我们的应用程序中观察到的密集分布事件。最后,我们展示了使用语音活动检测嵌入作为前景检测的功能的改进。我们还研究了不同的合并方法,以及如何适应我们的应用程序中观察到的密集分布事件。最后,我们展示了使用语音活动检测嵌入作为前景检测的功能的改进。
更新日期:2021-02-04
down
wechat
bug