Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection
arXiv - CS - Sound Pub Date : 2021-07-14 , DOI: arxiv-2107.06592
Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou, Haizhou Li

Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers. The successful ASD depends on accurate interpretation of short-term and long-term audio and visual information, as well as audio-visual interaction. Unlike the prior work where systems make decision instantaneously using short-term features, we propose a novel framework, named TalkNet, that makes decision by taking both short-term and long-term features into consideration. TalkNet consists of audio and visual temporal encoders for feature representation, audio-visual cross-attention mechanism for inter-modality interaction, and a self-attention mechanism to capture long-term speaking evidence. The experiments demonstrate that TalkNet achieves 3.5\% and 2.2\% improvement over the state-of-the-art systems on the AVA-ActiveSpeaker dataset and Columbia ASD dataset, respectively. Code has been made available at: \textcolor{magenta}{\url{https://github.com/TaoRuijie/TalkNet_ASD}}.

中文翻译：

有人在说话吗？探索用于视听有源说话人检测的长期时间特征

主动说话者检测 (ASD) 旨在检测在一个或多个说话者的视觉场景中谁在说话。成功的 ASD 取决于对短期和长期视听信息的准确解读，以及视听互动。与之前系统使用短期特征即时做出决策的工作不同，我们提出了一个名为 TalkNet 的新框架，该框架通过同时考虑短期和长期特征来做出决策。TalkNet 由用于特征表示的音频和视觉时间编码器、用于跨模态交互的视听交叉注意机制以及用于捕获长期说话证据的自注意机制组成。实验表明，TalkNet 达到了 3.5\% 和 2。分别在 AVA-ActiveSpeaker 数据集和 Columbia ASD 数据集上比最先进的系统提高了 2%。代码已在以下位置提供：\textcolor{magenta}{\url{https://github.com/TaoRuijie/TalkNet_ASD}}。

更新日期：2021-07-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>