Densifying Supervision for Fine-Grained Visual Comparisons,International Journal of Computer Vision

当前位置： X-MOL 学术 › Int. J. Comput. Vis. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Densifying Supervision for Fine-Grained Visual Comparisons
International Journal of Computer Vision ( IF 19.5 ) Pub Date : 2020-08-09 , DOI: 10.1007/s11263-020-01344-9
Aron Yu , Kristen Grauman

Detecting subtle differences in visual attributes requires inferring which of two images exhibits a property more, e.g., which face is smiling slightly more, or which shoe is slightly more sporty. While valuable for applications ranging from biometrics to online shopping, fine-grained attributes are challenging to learn. Unlike traditional recognition tasks, the supervision is inherently comparative. Thus, the space of all possible training comparisons is vast, and learning algorithms face a sparsity of supervision problem: it is difficult to curate adequate subtly different image pairs for each attribute of interest. We propose to overcome this problem by densifying the space of training images with attribute-conditioned image generation. The main idea is to create synthetic but realistic training images exhibiting slight modifications of attribute(s), obtain their comparative labels from human annotators, and use the labeled image pairs to augment real image pairs when training ranking functions for the attributes. We introduce two variants of our idea. The first passively synthesizes training images by “jittering” individual attributes in real training images. Building on this idea, our second model actively synthesizes training image pairs that would confuse the current attribute model, training both the attribute ranking functions and a generation controller simultaneously in an adversarial manner. For both models, we employ a conditional Variational Autoencoder (CVAE) to perform image synthesis. We demonstrate the effectiveness of bootstrapping imperfect image generators to counteract supervision sparsity in learning-to-rank models. Our approach yields state-of-the-art performance for challenging datasets from two distinct domains.

中文翻译：

细粒度视觉比较的密集监督

检测视觉属性中的细微差异需要推断两个图像中哪一个更能展现出某种属性，例如，哪张脸微笑得更明显，或者哪双鞋更运动。虽然对于从生物识别到在线购物的应用都很有价值，但细粒度的属性很难学习。与传统的识别任务不同，监督本质上是比较性的。因此，所有可能的训练比较的空间是巨大的，学习算法面临着监督稀疏的问题：很难为每个感兴趣的属性策划足够的细微不同的图像对。我们建议通过使用属性条件图像生成来密集训练图像的空间来克服这个问题。主要思想是创建合成但真实的训练图像，这些图像表现出对属性的轻微修改，从人类注释者那里获得它们的比较标签，并在训练属性的排序函数时使用标记的图像对来增强真实的图像对。我们介绍了我们想法的两种变体。第一个通过“抖动”真实训练图像中的单个属性来被动合成训练图像。基于这个想法，我们的第二个模型主动合成训练图像对，这些图像对会混淆当前的属性模型，以对抗方式同时训练属性排序函数和生成控制器。对于这两种模型，我们采用条件变分自动编码器 (CVAE) 来执行图像合成。我们证明了自举不完美图像生成器的有效性，以抵消学习排名模型中的监督稀疏性。我们的方法为来自两个不同领域的具有挑战性的数据集提供了最先进的性能。

更新日期：2020-08-09

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>