Cross-modal Variational Auto-encoder for Content-based Micro-video Background Music Recommendation

Yi, Jing; Zhu, Yaochen; Xie, Jiayi; Chen, Zhenzhong

doi:10.1109/TMM.2021.3128254

Computer Science > Multimedia

arXiv:2107.07268 (cs)

[Submitted on 15 Jul 2021 (v1), last revised 11 Dec 2022 (this version, v2)]

Title:Cross-modal Variational Auto-encoder for Content-based Micro-video Background Music Recommendation

Authors:Jing Yi, Yaochen Zhu, Jiayi Xie, Zhenzhong Chen

View PDF

Abstract:In this paper, we propose a cross-modal variational auto-encoder (CMVAE) for content-based micro-video background music recommendation. CMVAE is a hierarchical Bayesian generative model that matches relevant background music to a micro-video by projecting these two multimodal inputs into a shared low-dimensional latent space, where the alignment of two corresponding embeddings of a matched video-music pair is achieved by cross-generation. Moreover, the multimodal information is fused by the product-of-experts (PoE) principle, where the semantic information in visual and textual modalities of the micro-video are weighted according to their variance estimations such that the modality with a lower noise level is given more weights. Therefore, the micro-video latent variables contain less irrelevant information that results in a more robust model generalization. Furthermore, we establish a large-scale content-based micro-video background music recommendation dataset, TT-150k, composed of approximately 3,000 different background music clips associated to 150,000 micro-videos from different users. Extensive experiments on the established TT-150k dataset demonstrate the effectiveness of the proposed method. A qualitative assessment of CMVAE by visualizing some recommendation results is also included.

Subjects:	Multimedia (cs.MM); Information Retrieval (cs.IR)
Cite as:	arXiv:2107.07268 [cs.MM]
	(or arXiv:2107.07268v2 [cs.MM] for this version)
	https://doi.org/10.48550/arXiv.2107.07268
Related DOI:	https://doi.org/10.1109/TMM.2021.3128254

Submission history

From: Jing Yi [view email]
[v1] Thu, 15 Jul 2021 11:47:43 UTC (3,266 KB)
[v2] Sun, 11 Dec 2022 15:07:42 UTC (1,474 KB)

Computer Science > Multimedia

Title:Cross-modal Variational Auto-encoder for Content-based Micro-video Background Music Recommendation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Multimedia

Title:Cross-modal Variational Auto-encoder for Content-based Micro-video Background Music Recommendation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators