Aligning Source Visual and Target Language Domains for Unpaired Video Captioning.,IEEE Transactions on Pattern Analysis and Machine Intelligence

当前位置： X-MOL 学术 › IEEE Trans. Pattern Anal. Mach. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Aligning Source Visual and Target Language Domains for Unpaired Video Captioning.
IEEE Transactions on Pattern Analysis and Machine Intelligence ( IF 23.6 ) Pub Date : 2022-11-07 , DOI: 10.1109/tpami.2021.3132229
Fenglin Liu ₁ , Xian Wu ₂ , Chenyu You ₃ , Shen Ge ₂ , Yuexian Zou ₁ , Xu Sun ₄

Affiliation

Training supervised video captioning model requires coupled video-caption pairs. However, for many targeted languages, sufficient paired data are not available. To this end, we introduce the unpaired video captioning task aiming to train models without coupled video-caption pairs in target language. To solve the task, a natural choice is to employ a two-step pipeline system: first utilizing video-to-pivot captioning model to generate captions in pivot language and then utilizing pivot-to-target translation model to translate the pivot captions to the target language. However, in such a pipeline system, 1) visual information cannot reach the translation model, generating visual irrelevant target captions; 2) the errors in the generated pivot captions will be propagated to the translation model, resulting in disfluent target captions. To address these problems, we propose the Unpaired Video Captioning with Visual Injection system (UVC-VI). UVC-VI first introduces the Visual Injection Module (VIM), which aligns source visual and target language domains to inject the source visual information into the target language domain. Meanwhile, VIM directly connects the encoder of the video-to-pivot model and the decoder of the pivot-to-target model, allowing end-to-end inference by completely skipping the generation of pivot captions. To enhance the cross-modality injection of the VIM, UVC-VI further introduces a pluggable video encoder, i.e., Multimodal Collaborative Encoder (MCE). The experiments show that UVC-VI outperforms pipeline systems and exceeds several supervised systems. Furthermore, equipping existing supervised systems with our MCE can achieve 4% and 7% relative margins on the CIDEr scores to current state-of-the-art models on the benchmark MSVD and MSR-VTT datasets, respectively.

中文翻译：

为未配对的视频字幕对齐源视觉和目标语言域。

训练有监督的视频字幕模型需要耦合的视频字幕对。但是，对于许多目标语言，没有足够的配对数据可用。为此，我们引入了未配对的视频字幕任务，旨在训练没有目标语言中耦合视频字幕对的模型。为了解决这个任务，一个自然的选择是采用两步流水线系统：首先利用视频到枢轴字幕模型生成枢轴语言的字幕，然后利用枢轴到目标翻译模型将枢轴字幕翻译成目标语言。然而，在这样的管道系统中，1）视觉信息无法到达翻译模型，生成视觉无关的目标字幕；2）生成的枢轴字幕中的错误将传播到翻译模型，导致目标字幕混乱。为了解决这些问题，我们提出了带有视觉注入系统的未配对视频字幕 (UVC-VI)。UVC-VI首先引入了视觉注入模块（VIM），它将源视觉和目标语言域对齐，将源视觉信息注入目标语言域。同时，VIM 直接连接视频到枢轴模型的编码器和枢轴到目标模型的解码器，通过完全跳过枢轴字幕的生成来实现端到端推理。为了增强 VIM 的跨模态注入，UVC-VI 进一步引入了可插拔的视频编码器，即多模态协作编码器（MCE）。实验表明，UVC-VI 优于管道系统并超过了几个监督系统。此外，

更新日期：2021-12-02

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>