Audio-Visual Kinship Verification: A New Dataset and a Unified Adaptive Adversarial Multimodal Learning Approach,IEEE Transactions on Cybernetics

当前位置： X-MOL 学术 › IEEE Trans. Cybern. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Audio-Visual Kinship Verification: A New Dataset and a Unified Adaptive Adversarial Multimodal Learning Approach
IEEE Transactions on Cybernetics ( IF 9.4 ) Pub Date : 11-24-2022 , DOI: 10.1109/tcyb.2022.3220040
Xiaoting Wu ₁ , Xueyi Zhang ₂ , Xiaoyi Feng ₃ , Miguel Bordallo Lopez ₁ , Li Liu ₁

Affiliation

Facial kinship verification refers to automatically determining whether two people have a kin relation from their faces. It has become a popular research topic due to potential practical applications. Over the past decade, many efforts have been devoted to improving the verification performance from human faces only while lacking other biometric information, for example, speaking voice. In this article, to interpret and benefit from multiple modalities, we propose for the first time to combine human faces and voices to verify kinship, which we refer it as the audio-visual kinship verification study. We first establish a comprehensive audio-visual kinship dataset that consists of familial talking facial videos under various scenarios, called TALKIN-Family. Based on the dataset, we present the extensive evaluation of kinship verification from faces and voices. In particular, we propose a deep-learning-based fusion method, called unified adaptive adversarial multimodal learning (UAAML). It consists of the adversarial network and the attention module on the basis of unified multimodal features. Experiments show that audio (voice) information is complementary to facial features and useful for the kinship verification problem. Furthermore, the proposed fusion method outperforms baseline methods. In addition, we also evaluate the human verification ability on a subset of TALKIN-Family. It indicates that humans have higher accuracy when they have access to both faces and voices. The machine-learning methods could effectively and efficiently outperform the human ability. Finally, we include the future work and research opportunities with the TALKIN-Family dataset.

中文翻译：

视听亲属关系验证：新数据集和统一的自适应对抗性多模态学习方法

人脸亲属验证是指根据人脸自动判断两个人是否有亲属关系。由于潜在的实际应用，它已成为一个热门的研究课题。在过去的十年中，许多努力都致力于提高人脸的验证性能，而缺乏其他生物识别信息，例如说话的声音。在本文中，为了解释并受益于多种模式，我们首次提出结合人脸和声音来验证亲属关系，我们将其称为视听亲属关系验证研究。我们首先建立一个全面的视听亲属关系数据集，由各种场景下的家庭谈话面部视频组成，称为TALKIN-Family。基于该数据集，我们提出了对面部和声音亲属关系验证的广泛评估。特别是，我们提出了一种基于深度学习的融合方法，称为统一自适应对抗性多模态学习（UAAML）。它由基于统一多模态特征的对抗网络和注意力模块组成。实验表明，音频（语音）信息与面部特征是互补的，对于亲属关系验证问题很有用。此外，所提出的融合方法优于基线方法。此外，我们还在 TALKIN-Family 的一个子集上评估了人类验证能力。这表明人类在获得面部和声音时具有更高的准确性。机器学习方法可以有效且高效地超越人类的能力。最后，我们介绍了 TALKIN-Family 数据集的未来工作和研究机会。

更新日期：2024-08-26

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11