Emotion-Based End-to-End Matching Between Image and Music in Valence-Arousal Space,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Emotion-Based End-to-End Matching Between Image and Music in Valence-Arousal Space
arXiv - CS - Sound Pub Date : 2020-08-22 , DOI: arxiv-2009.05103
Sicheng Zhao, Yaxian Li, Xingxu Yao, Weizhi Nie, Pengfei Xu, Jufeng Yang, Kurt Keutzer

Both images and music can convey rich semantics and are widely used to induce specific emotions. Matching images and music with similar emotions might help to make emotion perceptions more vivid and stronger. Existing emotion-based image and music matching methods either employ limited categorical emotion states which cannot well reflect the complexity and subtlety of emotions, or train the matching model using an impractical multi-stage pipeline. In this paper, we study end-to-end matching between image and music based on emotions in the continuous valence-arousal (VA) space. First, we construct a large-scale dataset, termed Image-Music-Emotion-Matching-Net (IMEMNet), with over 140K image-music pairs. Second, we propose cross-modal deep continuous metric learning (CDCML) to learn a shared latent embedding space which preserves the cross-modal similarity relationship in the continuous matching space. Finally, we refine the embedding space by further preserving the single-modal emotion relationship in the VA spaces of both images and music. The metric learning in the embedding space and task regression in the label space are jointly optimized for both cross-modal matching and single-modal VA prediction. The extensive experiments conducted on IMEMNet demonstrate the superiority of CDCML for emotion-based image and music matching as compared to the state-of-the-art approaches.

中文翻译：

价-唤醒空间中基于情感的图像和音乐的端到端匹配

图像和音乐都可以传达丰富的语义，被广泛用于诱发特定的情绪。匹配具有相似情绪的图像和音乐可能有助于使情绪感知更加生动和强烈。现有的基于情感的图像和音乐匹配方法要么采用有限的分类情感状态，不能很好地反映情感的复杂性和微妙性，要么使用不切实际的多级管道来训练匹配模型。在本文中，我们研究了基于连续价唤醒（VA）空间中的情绪的图像和音乐之间的端到端匹配。首先，我们构建了一个大规模数据集，称为 Image-Music-Emotion-Matching-Net (IMEMNet)，具有超过 140K 的图像-音乐对。第二，我们提出跨模态深度连续度量学习（CDCML）来学习共享的潜在嵌入空间，该空间在连续匹配空间中保留跨模态相似性关系。最后，我们通过进一步保留图像和音乐的 VA 空间中的单模态情感关系来改进嵌入空间。嵌入空间中的度量学习和标签空间中的任务回归针对跨模态匹配和单模态 VA 预测进行了联合优化。与最先进的方法相比，在 IMEMNet 上进行的大量实验证明了 CDCML 在基于情感的图像和音乐匹配方面的优越性。我们通过进一步保留图像和音乐的 VA 空间中的单模态情感关系来改进嵌入空间。嵌入空间中的度量学习和标签空间中的任务回归针对跨模态匹配和单模态 VA 预测进行了联合优化。与最先进的方法相比，在 IMEMNet 上进行的大量实验证明了 CDCML 在基于情感的图像和音乐匹配方面的优越性。我们通过进一步保留图像和音乐的 VA 空间中的单模态情感关系来改进嵌入空间。嵌入空间中的度量学习和标签空间中的任务回归针对跨模态匹配和单模态 VA 预测进行了联合优化。与最先进的方法相比，在 IMEMNet 上进行的大量实验证明了 CDCML 在基于情感的图像和音乐匹配方面的优越性。

更新日期：2020-09-14

点击分享查看原文

点击收藏

阅读更多本刊最新论文