当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain
arXiv - CS - Computation and Language Pub Date : 2021-02-23 , DOI: arxiv-2102.11588
Julio Wissing, Benedikt Boenninghoff, Dorothea Kolossa, Tsubasa Ochiaiy, Marc Delcroixy, Keisuke Kinoshitay, Tomohiro Nakataniy, Shoko Arakiy, Christopher Schymura

Estimating the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization. Both applications benefit from a known speaker position when, for instance, applying beamforming or assigning unique speaker identities. Recently, several approaches utilizing acoustic signals augmented with visual data have been proposed for this task. However, both the acoustic and the visual modality may be corrupted in specific spatial regions, for instance due to poor lighting conditions or to the presence of background noise. This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions in the localization space. This fusion is achieved via a neural network, which combines the predictions of individual audio and video trackers based on their time- and location-dependent reliability. A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.

中文翻译:

用于视听扬声器本地化的数据融合:将动态流权重扩展到空间域

估计多个说话者的位置可能有助于自动语音识别或说话者二值化等任务。例如,在应用波束成形或分配唯一的说话人身份时,这两种应用都会受益于说话人的已知位置。最近,已经提出了几种利用声学信号增强了视觉数据的方法来完成该任务。然而,例如由于不良的照明条件或背景噪声的存在,在特定的空间区域中,声音和视觉形态都可能被破坏。本文通过将单独的动态流权重分配给定位空间中的特定区域,为说话者定位提出了一种新颖的视听数据融合框架。这种融合是通过神经网络实现的,结合了基于时间和位置的可靠性的单个音频和视频跟踪器的预测。使用视听记录的性能评估可产生令人鼓舞的结果,其中所提出的融合方法优于所有基线模型。
更新日期:2021-02-24
down
wechat
bug