当前位置: X-MOL 学术Pattern Recogn. Lett. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A text-based visual context modulation neural model for multimodal machine translation
Pattern Recognition Letters ( IF 5.1 ) Pub Date : 2020-06-15 , DOI: 10.1016/j.patrec.2020.06.010
Soonmo Kwon , Byung-Hyun Go , Jong-Hyeok Lee

We introduce a novel multimodal machine translation model that integrates image features modulated by its caption. Generally, images contain vastly more information rather than just their description. Furthermore, in multimodal machine translation task, feature maps are commonly extracted from pre-trained network for objects. Therefore, it is not appropriate to utilize these feature map directly. To extract the visual features associated with the text, we design a modulation network based on the textual information from the encoder and visual information from the pretrained CNN. However, because multimodal translation data is scarce, using overly complicated models could result in poor performance. For simplicity, we apply a feature-wise multiplicative transformation. Therefore, our model is a modular trainable network embedded in the architecture in existing multimodal translation models. We verified our model by conducting experiments on the Transformer model with the Multi30k dataset and evaluating translation quality using the BLEU and METEOR metrics. In general, our model was an improvements over a text-based model and other existing models.



中文翻译:

用于多模式机器翻译的基于文本的视觉上下文调制神经模型

我们介绍了一种新颖的多模式机器翻译模型,该模型整合了由其标题调制的图像特征。通常,图像包含更多信息,而不仅仅是其描述。此外,在多模式机器翻译任务中,通常从对象的预训练网络中提取特征图。因此,直接利用这些特征图是不合适的。为了提取与文本关联的视觉特征,我们基于来自编码器的文本信息和来自预训练的CNN的视觉信息,设计了一个调制网络。但是,由于缺少多模式翻译数据,因此使用过于复杂的模型可能会导致性能不佳。为简单起见,我们应用了按功能进行乘法变换。因此,我们的模型是在现有多模式翻译模型中嵌入体系结构的模块化可训练网络。我们通过在带有Multi30k数据集的Transformer模型上进行实验并使用BLEU和METEOR指标评估翻译质量来验证我们的模型。通常,我们的模型是对基于文本的模型和其他现有模型的改进。

更新日期:2020-06-23
down
wechat
bug