The DKU Post-Challenge Audio-Visual Wake Word Spotting System for the 2021 MISP Challenge: Deep Analysis,arXiv - EE - Audio and Speech Processing

当前位置： X-MOL 学术 › arXiv.eess.AS › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

The DKU Post-Challenge Audio-Visual Wake Word Spotting System for the 2021 MISP Challenge: Deep Analysis
arXiv - EE - Audio and Speech Processing Pub Date : 2023-03-04 , DOI: arxiv-2303.02348
Haoxu Wang, Ming Cheng, Qiang Fu, Ming Li

This paper further explores our previous wake word spotting system ranked 2-nd in Track 1 of the MISP Challenge 2021. First, we investigate a robust unimodal approach based on 3D and 2D convolution and adopt the simple attention module (SimAM) for our system to improve performance. Second, we explore different combinations of data augmentation methods for better performance. Finally, we study the fusion strategies, including score-level, cascaded and neural fusion. Our proposed multimodal system leverages multimodal features and uses the complementary visual information to mitigate the performance degradation of audio-only systems in complex acoustic scenarios. Our system obtains a false reject rate of 2.15% and a false alarm rate of 3.44% in the evaluation set of the competition database, which achieves the new state-of-the-art performance by 21% relative improvement compared to previous systems.

中文翻译：

昆山杜克大学 2021 年 MISP 挑战赛赛后视听唤醒词识别系统：深度分析

本文进一步探讨了我们之前在 2021 年 MISP 挑战赛第 1 轨道中排名第二的唤醒词发现系统。首先，我们研究了一种基于 3D 和 2D 卷积的稳健单峰方法，并为我们的系统采用了简单的注意模块 (SimAM)提高性能。其次，我们探索不同的数据增强方法组合以获得更好的性能。最后，我们研究了融合策略，包括分数级融合、级联融合和神经融合。我们提出的多模态系统利用多模态特征并使用互补的视觉信息来减轻复杂声学场景中纯音频系统的性能下降。我们的系统在竞赛数据库的评估集中获得了 2.15% 的错误拒绝率和 3.44% 的错误警报率，

更新日期：2023-03-07

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>