当前位置: X-MOL 学术Int. J. Comput. Vis. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Combining Multiple Cues for Visual Madlibs Question Answering
International Journal of Computer Vision ( IF 19.5 ) Pub Date : 2018-05-03 , DOI: 10.1007/s11263-018-1096-0
Tatiana Tommasi , Arun Mallya , Bryan Plummer , Svetlana Lazebnik , Alexander C. Berg , Tamara L. Berg

This paper presents an approach for answering fill-in-the-blank multiple choice questions from the Visual Madlibs dataset. Instead of generic and commonly used representations trained on the ImageNet classification task, our approach employs a combination of networks trained for specialized tasks such as scene recognition, person activity classification, and attribute prediction. We also present a method for localizing phrases from candidate answers in order to provide spatial support for feature extraction. We map each of these features, together with candidate answers, to a joint embedding space through normalized canonical correlation analysis (nCCA). Finally, we solve an optimization problem to learn to combine scores from nCCA models trained on multiple cues to select the best answer. Extensive experimental results show a significant improvement over the previous state of the art and confirm that answering questions from a wide range of types benefits from examining a variety of image cues and carefully choosing the spatial support for feature extraction.

中文翻译:

结合 Visual Madlibs 问答的多个提示

本文提出了一种从 Visual Madlibs 数据集中回答填空多项选择题的方法。我们的方法不是在 ImageNet 分类任务上训练的通用和常用表示,而是采用为专门任务训练的网络组合,例如场景识别、人员活动分类和属性预测。我们还提出了一种从候选答案中定位短语的方法,以便为特征提取提供空间支持。我们通过归一化规范相关分析 (nCCA) 将这些特征中的每一个与候选答案一起映射到联合嵌入空间。最后,我们解决了一个优化问题,以学习结合在多个线索上训练的 nCCA 模型的分数来选择最佳答案。
更新日期:2018-05-03
down
wechat
bug