DeepMSRF: A novel Deep Multimodal Speaker Recognition framework with Feature selection,arXiv - CS - Computer Vision and Pattern Recognition

当前位置： X-MOL 学术 › arXiv.cs.CV › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

DeepMSRF: A novel Deep Multimodal Speaker Recognition framework with Feature selection
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2020-07-14 , DOI: arxiv-2007.06809
Ehsan Asali, Farzan Shenavarmasouleh, Farid Ghareh Mohammadi, Prasanth Sengadu Suresh, and Hamid R. Arabnia

For recognizing speakers in video streams, significant research studies have been made to obtain a rich machine learning model by extracting high-level speaker's features such as facial expression, emotion, and gender. However, generating such a model is not feasible by using only single modality feature extractors that exploit either audio signals or image frames, extracted from video streams. In this paper, we address this problem from a different perspective and propose an unprecedented multimodality data fusion framework called DeepMSRF, Deep Multimodal Speaker Recognition with Feature selection. We execute DeepMSRF by feeding features of the two modalities, namely speakers' audios and face images. DeepMSRF uses a two-stream VGGNET to train on both modalities to reach a comprehensive model capable of accurately recognizing the speaker's identity. We apply DeepMSRF on a subset of VoxCeleb2 dataset with its metadata merged with VGGFace2 dataset. The goal of DeepMSRF is to identify the gender of the speaker first, and further to recognize his or her name for any given video stream. The experimental results illustrate that DeepMSRF outperforms single modality speaker recognition methods with at least 3 percent accuracy.

中文翻译：

DeepMSRF：一种具有特征选择的新型深度多模态说话人识别框架

为了识别视频流中的说话人，已经进行了大量的研究，通过提取高级说话人的面部表情、情感和性别等特征来获得丰富的机器学习模型。然而，通过仅使用从视频流中提取的音频信号或图像帧的单一模态特征提取器来生成这样的模型是不可行的。在本文中，我们从不同的角度解决了这个问题，并提出了一种前所未有的多模态数据融合框架，称为 DeepMSRF，Deep Multimodal Speaker Recognition with Feature selection。我们通过提供两种模式的特征来执行 DeepMSRF，即说话者的音频和面部图像。DeepMSRF 使用双流 VGGNET 对两种模态进行训练，以达到能够准确识别说话者身份的综合模型。我们在 VoxCeleb2 数据集的子集上应用 DeepMSRF，其元数据与 VGGFace2 数据集合并。DeepMSRF 的目标是首先识别说话者的性别，并进一步识别任何给定视频流的他或她的名字。实验结果表明，DeepMSRF 以至少 3% 的准确率优于单模态说话人识别方法。

更新日期：2020-07-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>