Three-Dimensional Speaker Localization: Audio-Refined Visual Scaling Factor Estimation,IEEE Signal Processing Letters

当前位置： X-MOL 学术 › IEEE Signal Process. Lett. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Three-Dimensional Speaker Localization: Audio-Refined Visual Scaling Factor Estimation
IEEE Signal Processing Letters ( IF 3.2 ) Pub Date : 2021-06-28 , DOI: 10.1109/lsp.2021.3092959
Xinyuan Qian , Qi Liu , Jiadong Wang , Haizhou Li

Neither a monocular RGB camera nor a small-size microphone array is capable of accurate three-dimensional (3D) speaker localization. By taking advantage of accurate visual object detection, and audio-visual complementary sensor fusion, we formulate the three-dimensional (3D) speaker localization problem as a visual scaling factor estimation problem. As a result, we effectively reduce the traditional audio-only 3D speaker localization from an exhaustive grid search to a one-dimensional (1D) optimization problem. We propose a multi-modal perception system with two optimization approaches. We show that the proposed methods are effective, accurate, and robust against interference and, as corroborated by indicative empirical results on real dataset, competitive to the conventional uni-modal and the state-of-the-art audio-visual speaker localization approaches.

中文翻译：

三维扬声器定位：音频改进的视觉缩放因子估计

单目 RGB 摄像头和小尺寸麦克风阵列都无法实现精确的三维 (3D) 说话者定位。通过利用准确的视觉对象检测和视听互补传感器融合，我们将三维（3D）说话人定位问题表述为视觉缩放因子估计问题。因此，我们有效地将传统的纯音频 3D 说话者定位从详尽的网格搜索简化为一维 (1D) 优化问题。我们提出了一种具有两种优化方法的多模态感知系统。我们表明，所提出的方法是有效、准确且抗干扰的鲁棒性，并且正如真实数据集的指示性经验结果所证实的那样，与传统的单模态和最先进的视听说话者定位方法相比具有竞争力。

更新日期：2021-06-28

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11