当前位置: X-MOL 学术IEEE J. Sel. Top. Signal Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Perfect Match: Self-Supervised Embeddings for Cross-modal Retrieval
IEEE Journal of Selected Topics in Signal Processing ( IF 7.5 ) Pub Date : 2020-03-01 , DOI: 10.1109/jstsp.2020.2987720
Soo-Whan Chung , Joon Son Chung , Hong-Goo Kang

This paper proposes a new strategy for learning effective cross-modal joint embeddings using self-supervision. We set up the problem as one of cross-modal retrieval, where the objective is to find the most relevant data in one domain given input in another. The method builds on the recent advances in learning representations from cross-modal self-supervision using contrastive or binary cross-entropy loss functions. To investigate the robustness of the proposed learning strategy across multi-modal applications, we perform experiments for two applications – audio-visual synchronisation and cross-modal biometrics. The audio-visual synchronisation task requires temporal correspondence between modalities to obtain joint representation of phonemes and visemes, and the cross-modal biometrics task requires common speakers representations given their face images and audio tracks. Experiments show that the performance of systems trained using proposed method far exceed that of existing methods on both tasks, whilst allowing significantly faster training.

中文翻译:

完美匹配:用于跨模态检索的自监督嵌入

本文提出了一种使用自我监督学习有效跨模态联合嵌入的新策略。我们将问题设置为跨模态检索之一,其目标是在给定输入的另一个域中找到最相关的数据。该方法建立在使用对比或二元交叉熵损失函数从跨模态自我监督学习表征的最新进展之上。为了研究所提议的学习策略在多模态应用中的稳健性,我们对两个应用进行了实验——视听同步和跨模态生物识别。视听同步任务需要模态之间的时间对应,以获得音素和视位的联合表示,并且跨模态生物识别任务需要给定他们的面部图像和音轨的公共说话者表示。实验表明,使用所提出的方法训练的系统的性能在这两个任务上都远远超过现有方法,同时可以显着加快训练速度。
更新日期:2020-03-01
down
wechat
bug