当前位置: X-MOL 学术IEEE Trans. Multimedia › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Learning to Visualize Music Through Shot Sequence for Automatic Concert Video Mashup
IEEE Transactions on Multimedia ( IF 8.4 ) Pub Date : 2020-06-22 , DOI: 10.1109/tmm.2020.3003631
Wen-Li Wei , Jen-Chun Lin , Tyng-Luh Liu , Hsiao-Rong Tyan , Hsin-Min Wang , Hong-Yuan Mark Liao

An experienced director usually switches among different types of shots to make visual storytelling more touching. When filming a musical performance, appropriate switching shots can produce some special effects, such as enhancing the expression of emotion or heating up the atmosphere. However, while the visual storytelling technique is often used in making professional recordings of a live concert, amateur recordings of audiences often lack such storytelling concepts and skills when filming the same event. Thus a versatile system that can perform video mashup to create a refined high-quality video from such amateur clips is desirable. To this end, we aim at translating the music into an attractive shot (type) sequence by learning the relation between music and visual storytelling of shots. The resulting shot sequence can then be used to better portray the visual storytelling of a song and guide the concert video mashup process. To achieve the task, we first introduces a novel probabilistic-based fusion approach, named as multi-resolution fused recurrent neural networks (MF-RNNs) with film-language, which integrates multi-resolution fused RNNs and a film-language model for boosting the translation performance. We then distill the knowledge in MF-RNNs with film-language into a lightweight RNN, which is more efficient and easier to deploy. The results from objective and subjective experiments demonstrate that both MF-RNNs with film-language and lightweight RNN can generate attractive shot sequences for music, thereby enhancing the viewing and listening experience.

中文翻译:


学习通过自动音乐会视频混搭的镜头序列来可视化音乐



经验丰富的导演通常会在不同类型的镜头之间进行切换,以使视觉叙事更加感人。在拍摄音乐表演时,适当的切换镜头可以产生一些特殊效果,例如增强情感的表达或升温气氛。然而,虽然视觉讲故事技术经常被用来制作现场音乐会的专业录音,但业余观众录音在拍摄同一事件时往往缺乏这种讲故事的概念和技巧。因此,需要一个能够执行视频混搭以从此类业余剪辑创建精致的高质量视频的多功能系统。为此,我们的目标是通过学习音乐与镜头视觉叙事之间的关系,将音乐转化为有吸引力的镜头(类型)序列。生成的镜头序列可用于更好地描绘歌曲的视觉故事并指导音乐会视频混搭过程。为了实现这一任务,我们首先引入了一种新颖的基于概率的融合方法,称为带有电影语言的多分辨率融合循环神经网络(MF-RNN),它集成了多分辨率融合RNN和电影语言模型来增强翻译表现。然后,我们将电影语言中的 MF-RNN 知识提炼成轻量级 RNN,它更高效且更易于部署。客观和主观实验的结果表明,带有电影语言的 MF-RNN 和轻量级 RNN 都可以生成有吸引力的音乐镜头序列,从而增强观看和聆听体验。
更新日期:2020-06-22
down
wechat
bug