Cross-modal Variational Auto-encoder for Content-based Micro-video Background Music Recommendation,arXiv - CS - Information Retrieval

当前位置： X-MOL 学术 › arXiv.cs.IR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Cross-modal Variational Auto-encoder for Content-based Micro-video Background Music Recommendation
arXiv - CS - Information Retrieval Pub Date : 2021-07-15 , DOI: arxiv-2107.07268
Jing Yi, Yaochen Zhu, Jiayi Xie, Zhenzhong Chen

In this paper, we propose a cross-modal variational auto-encoder (CMVAE) for content-based micro-video background music recommendation. CMVAE is a hierarchical Bayesian generative model that matches relevant background music to a micro-video by projecting these two multimodal inputs into a shared low-dimensional latent space, where the alignment of two corresponding embeddings of a matched video-music pair is achieved by cross-generation. Moreover, the multimodal information is fused by the product-of-experts (PoE) principle, where the semantic information in visual and textual modalities of the micro-video are weighted according to their variance estimations such that the modality with a lower noise level is given more weights. Therefore, the micro-video latent variables contain less irrelevant information that results in a more robust model generalization. Furthermore, we establish a large-scale content-based micro-video background music recommendation dataset, TT-150k, composed of approximately 3,000 different background music clips associated to 150,000 micro-videos from different users. Extensive experiments on the established TT-150k dataset demonstrate the effectiveness of the proposed method. A qualitative assessment of CMVAE by visualizing some recommendation results is also included.

中文翻译：

基于内容的微视频背景音乐推荐的跨模态变分自动编码器

在本文中，我们提出了一种用于基于内容的微视频背景音乐推荐的跨模态变分自动编码器（CMVAE）。CMVAE 是一种分层贝叶斯生成模型，通过将这两个多模态输入投影到共享的低维潜在空间中，将相关背景音乐与微视频相匹配，其中匹配的视频音乐对的两个对应嵌入的对齐是通过交叉实现的。 -一代。此外，多模态信息通过专家产品（PoE）原理融合，其中微视频的视觉和文本模态中的语义信息根据其方差估计进行加权，使得具有较低噪声水平的模态为给予更多的权重。所以，微视频潜在变量包含较少的无关信息，从而导致更稳健的模型泛化。此外，我们建立了一个大规模的基于内容的微视频背景音乐推荐数据集 TT-150k，由大约 3,000 个不同的背景音乐剪辑与来自不同用户的 150,000 个微视频相关联。在已建立的 TT-150k 数据集上进行的大量实验证明了所提出方法的有效性。还包括通过可视化一些推荐结果对 CMVAE 的定性评估。在已建立的 TT-150k 数据集上进行的大量实验证明了所提出方法的有效性。还包括通过可视化一些推荐结果对 CMVAE 的定性评估。在已建立的 TT-150k 数据集上进行的大量实验证明了所提出方法的有效性。还包括通过可视化一些推荐结果对 CMVAE 的定性评估。

更新日期：2021-07-16

点击分享查看原文

点击收藏

阅读更多本刊最新论文