Skip to main content

Advertisement

Log in

Densifying Supervision for Fine-Grained Visual Comparisons

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Detecting subtle differences in visual attributes requires inferring which of two images exhibits a property more, e.g., which face is smiling slightly more, or which shoe is slightly more sporty. While valuable for applications ranging from biometrics to online shopping, fine-grained attributes are challenging to learn. Unlike traditional recognition tasks, the supervision is inherently comparative. Thus, the space of all possible training comparisons is vast, and learning algorithms face a sparsity of supervision problem: it is difficult to curate adequate subtly different image pairs for each attribute of interest. We propose to overcome this problem by densifying the space of training images with attribute-conditioned image generation. The main idea is to create synthetic but realistic training images exhibiting slight modifications of attribute(s), obtain their comparative labels from human annotators, and use the labeled image pairs to augment real image pairs when training ranking functions for the attributes. We introduce two variants of our idea. The first passively synthesizes training images by “jittering” individual attributes in real training images. Building on this idea, our second model actively synthesizes training image pairs that would confuse the current attribute model, training both the attribute ranking functions and a generation controller simultaneously in an adversarial manner. For both models, we employ a conditional Variational Autoencoder (CVAE) to perform image synthesis. We demonstrate the effectiveness of bootstrapping imperfect image generators to counteract supervision sparsity in learning-to-rank models. Our approach yields state-of-the-art performance for challenging datasets from two distinct domains.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Similar content being viewed by others

Notes

  1. https://en.wikipedia.org/wiki/Streetlight_effect.

  2. http://vision.cs.utexas.edu/projects/finegrained/utzap50k/.

  3. We use the original model rather than the variant disCVAE (Yan et al. 2016a) since the latter requires additional supervision in the form of foreground object masks.

  4. In our experiments, we use only ordered pairs for both training and testing.

  5. Note that here the word “identity” means an instance for some domain, not necessarily a human identity.

  6. Note that this prior is nonetheless assumed to be coarse, since a subset of dimensions in \(\varvec{y}\) consist of the very attributes we wish to learn better via densifying supervision. For the sake of the prior, the training image attribute strengths originate from the raw decision outputs of a preliminary binary attribute classifier trained on disjoint data labeled for the presence/absence of the attribute. Note that this is a practical simplification: ideally the images that train Attribute2Image would be manually labeled for their attribute strengths. It is more cost-effective to use the real-valued classifier outputs for models trained on binary-labeled images, as done here and in Yan et al. (2016b).

  7. We use “batch” here in the active learning sense: a batch of additional examples are manually labeled then used to update the predictive model. This is not to be confused with (mini)-batches for training the neural networks.

  8. Note that the mean results in Tables 1 and 2 for ATTIC-Auto is equivalent to the rightmost data point (@100%) in Fig. 14.

  9. We do not show Jitter here because the \(\varvec{y}\) values for the Jitter baseline are simply the same as their Real counterparts.

  10. We used only the words from Design 1 as the two designs produced very similar word suggestions.

References

  • Alabdulmohsin, I. Gao, X., & Zhang, X. (2015). Efficient active learning of halfspaces via query synthesis. In AAAI.

  • Altwaijry, H., & Belongie, S., (2012). Relative ranking of facial attractiveness. In Winter conference on applications of computer vision (WACV).

  • Angluin, D. (1988). Queries and concept learning. In Machine learning.

  • Baum, E., & Lang, K. (1992). Query learning can work poorly when a human oracle is used. In IJCNN.

  • Biswas, A., & Parikh, D. (2013). Simultaneous active learning of classifiers and attributes via relative feedback. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Branson, S., Wah, C., Schroff, F., Babenko, B., Welinder, P., Perona, P., & Belongie, S. (2010). Visual recognition with humans in the loop. In Proceedings of European conference on computer vision (ECCV).

  • Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., et al. (2015). Learning to rank using gradient descent. In Proceedings of international conference on machine learning (ICML).

  • Cai, J., Zha, Z., Wang, M., Zhang, S., & Tian, Q. (2015). An Attribute-assisted reranking model for web image search. IEEE Transactions on Image Processing

  • Cao, C., Kwak, I., Belongie, S., Kriegman, D., & Ai, H. (2014). Adaptive ranking of facial attractiveness. In International conference on multimedia and expo (ICME).

  • Changpinyo, S., Chao, W.-L., & Sha, F. (2017). Predicting visual exemplars of unseen classes for zero-shot learning. In Proceedings of the IEEE international conference on computer vision (ICCV).

  • Chaudhuri, S., Kalogerakis, E., Giguere, S., & Funkhouser, T. (2013). AttribIt: Content creation with semantic attributes. In ACM symposium on user interface software and technology (UIST).

  • Chen, K., Gong, S., Xiang, T., & Loy, C. (2013). Cumulative attribute space for age and crowd density estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Chen, L., Zhang, Q., & Li, B. (2014). Predicting multiple attributes via relative multi-task learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Choi, Y., Choi, M., Kim, M., Ha, J.-W., Kim, S., & Choo, J. (2018). Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Datta, A., Feris, R., & Vaquero, D. (2011). Hierarchical Ranking of Facial Attributes. In Face and gesture.

  • Demirel, B., Cinbis, R. G., & Ikizler-Cinbis, N. (2017). Attributes2Classname: a discriminative model for attribute-based unsupervised zero-shot learning. In Proceedings of the IEEE international conference on computer vision (ICCV).

  • Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Dixit, M., Kwitt, R., Niethammer, M., & Vasconcelos, N. (2017). AGA: attribute-guided augmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Dosovitskiy, A., Springenberg, J., & Brox, T. (2015). Learning to generate chairs with convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Dosovitskiy, A., Springenberg, J., Riedmiller, M., & Brox, T. (2014). Discriminative unsupervised feature learning with convolutional neural networks. In Advances in neural information processing systems (NIPS).

  • Fan, Q., Gabbur, P., & Pankanti, S. (2013). Relative attributes for large-scale abandoned object detection. In Proceedings of the IEEE international conference on computer vision (ICCV).

  • Farhadi, A., Endres, I., Hoiem, D., & Forsyth, D. (2009). Describing objects by their attributes. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Farrell, R., Oza, O., Zhang, N., Morariu, V., Darrell, T., & Davis, L. (2011). Birdlets: Subordinate categorization using volumetric primitives and pose-normalized appearance. In International conference on computer vision (ICCV).

  • Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 32(9).

  • Freedman, D. (2010). Why scientific studies are so often wrong: The streetlight effect. Discover.

  • Freytag, A., Rodner, E., & Denzler, J. (2014). Selecting influential examples: active learning with expected model output changes. In ECCV.

  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems (NIPS).

  • Gregor, K., Danihelka, I., Graves, A., & Wierstra, D. (2015). DRAW: A recurrent neural network for image generation. In Proceedings of international conference on machine learning (ICML).

  • Hariharan, B., & Girshick, R. (2017). Low-shot visual recognition by shrinking and hallucinating features. In Proceedings of the IEEE international conference on computer vision (ICCV).

  • Hauberg, S., Freifeld, O., Larsen, A., Fisher, J., & Hansen, L. (2017). Dreaming more data: Class-dependent distributions over diffeomorphisms for learned data augmentation. In International conference on artificial intelligence and statistics.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2017). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision (ICCV).

  • Huang, G. B., Ramesh, M., Berg, T., & Learned-Miller, E. (2007). Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical report, University of Massachusetts, Amherst.

  • Huang, X., Liu, M.-Y., Belongie, S., & Kautz, J. (2018). Multimodal unsupervised image-to-image translation. In ECCV.

  • Huijser, M. W., & van Gemert, J. C. (2017). Active decision boundary annotation with deep generative models. In Proceedings of the IEEE international conference on computer vision (ICCV).

  • Isola, P., Zhu, J.-Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In CVPR.

  • Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Synthetic data and artificial neural networks for natural scene text recognition. In NIPS 14 deep learning workshop.

  • Jaderberg, M., Simonyan, K., Zisserman, A., & Kavukcuoglu, K. (2015). Spatial transformer networks. In Advances in neural information processing systems (NIPS).

  • Joachims, T. (2002). Optimizing search engines using clickthrough data. In Knowledge discovery in databases (PKDD).

  • Kalayeh, M. M., Gong, B., & Shah, M. (2017). Improving facial attribute prediction using semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Kemelmacher-Shlizerman, I., Suwajanakorn, S., & Seitz, S. (2014). Illumination-aware age progression. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Khoreva, A., Benenson, R., Ilg, E., Brox, T., & Schiele, B. (2017). Lucid data dreaming for object tracking. Technical Report arXiv:1703.09554,

  • Khosla, A., Bainbridge, W. A., Torralba, A., & Oliva, A. (2013). Modifying the memorability of face photographs. In International conference on computer vision (ICCV).

  • Kingma, D., & Welling, M. (2014). Auto-encoding variational bayes. In Proceedings international conference on learning representations (ICLR).

  • Kovashka, A., & Grauman, K. (2013). Attribute pivots for guiding relevance feedback in image search. In International conference on computer vision (ICCV).

  • Kovashka, A., Parikh, D., & Grauman, K. (2012). WhittleSearch: Image search with relative attribute feedback. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Kovashka, A., Parikh, D., & Grauman, K. (2015). WhittleSearch: Interactive image search with relative attribute feedback. International Journal of Computer Vision (IJCV), 115(2), 185–210.

    Article  MathSciNet  Google Scholar 

  • Kulkarni, T., Whitney, W., Kohli, P., & Tenenbaum, J. (2015). Deep convolutional inverse graphics network. In Advances in neural information processing systems (NIPS).

  • Kumar, N., Belhumeur, P., & Nayar, S. (2008). FaceTracer: A search engine for large collections of images with faces. In European conference on computer vision (ECCV).

  • Kwitt, R., Hegenbart, S., & Niethammer, M. (2016). One-shot learning of scene locations via feature trajectory transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Laffont, P., Ren, Z., Tao, X., Qian, C., & Hays, J. (2014). Transient attributes for high-level understanding and editing of outdoor scenes. In SIGGRAPH.

  • Lampert, C., Nickisch, H., Harmeling, S. (2009). Learning to detect unseen object classes by between-class attribute transfer. In Conference on computer vision and pattern recognition (CVPR).

  • Lample, G., Zeghidour, N., Usunier, N., Bordes, A., Denoyer, L., et al. (2017). Fader networks: Manipulating images by sliding attributes. In NIPS (pp. 5963–5972).

  • Li, S., Shan, S., & Chen, X. (2012). Relative forest for attribute prediction. In Asian conference on computer vision (ACCV).

  • Li, M., Zuo, W., & Zhang, D. (2016). Convolutional network for attribute-driven and identity-preserving human face generation. Technical Report arXiv:1608.06434,

  • Liang, L., & Grauman, K. (2014). Beyond comparing image pairs: Setwise active learning for relative attributes. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Lu, Y., Tai, Y.-W., & Tang, C.-K. (2018). Attribute-guided face generation using conditional cyclegan. In Proceedings of European conference on computer vision (ECCV).

  • Maaten, L., & Hinton, G. (2008). Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research (JMLR), 9, 2579–2605.

    MATH  Google Scholar 

  • Maji, S. (2012). Discovering a lexicon of parts and attributes. In Second international workshop on parts and attributes, ECCV.

  • Maji, S., Kannala, J., Rahtu, E., Blaschko, M., & Vedaldi, A. (2013). Fine-grained visual classification of aircraft. Technical Report arXiv:1306.5151.

  • Matthews, T., Nixon, M. & Niranjan, M. (2013). Enriching texture analysis with semantic data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),

  • Meng, Z., Adluru, N., Kim, H. J., Fung, G., & Singh, V. (2018). Efficient relative attribute learning using graph neural networks. In Proceedings of European conference on computer vision (ECCV).

  • Miller, E., Matsakis, N., & Viola, P. (2000). Learning from one example through shared densities on transforms. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Moosavi-Dezfooli, S., Fawzi, A., & Frossard, P. (2016). DeepFool: A simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Nguyen, A., Yosinski, J., & Clune, J. (2015). Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • O’Donovan, P., Libeks, J., Agarwala, A., & Hertzmann, A. (2014). Exploratory font selection using crowdsourced attributes. In SIGGRAPH.

  • Pandey, G., & Dukkipati, A. (2016). Variational methods for conditional multimodal learning: Generating human faces from attributes. Technical Report arXiv:1603.01801.

  • Parikh, D., & Grauman, K. (2011). Relative attributes. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Park, D., & Ramanan, D. (2015). Articulated pose estimation with tiny synthetic videos. In ChaLearn workshop, CVPR.

  • Paulin, M., Revaud, J., Harchaoui, Z., Perronnin, F., & Schmid, C. (2014). Transformation pursuit for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Peng, X., Sun, B., Ali, K., & Saenko, K. (2015). Learning deep object detectors from 3D models. In Proceedings of the IEEE international conference on computer vision (ICCV).

  • Peng, X., Tang, Z., Yang, F., Feris, R. S., & Metaxas, D. (2018). Jointly optimize data augmentation and network training: Adversarial data augmentation in human pose estimation. In CVPR.

  • Pishchulin, L., Jain, A., Wojek, C., Thormahlen, T., & Schiele, B. (2011). In good shape: Robust people detection based on appearance and shape. In British machine vision conference (BMVC).

  • Qian, B., Wang, X., Wang, F., Li, H., Ye, J., & Davidson, I. (2013). Active learning from relative queries. In IJCAI international joint conference on artificial intelligence.

  • Radford, A., Metz, L., & Chintala, S. (2016). Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR.

  • Reid, D., & Nixon, M. (2013). Human identification using facial comparative descriptions. In ICB.

  • Reid, D., & Nixon, M. (2014). Using comparative human descriptions for soft biometrics. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 36.

  • Sadovnik, A., Gallagher, A., Parikh, D., & Chen, T. (2013). Spoken attributes: Mixing binary and relative attributes to say the right thing. In International conference on computer vision (ICCV).

  • Sandeep, R., Verma, Y., & Jawahar, C. (2014). Relative parts: Distinctive parts for learning relative attributes. In Conference on computer vision and pattern recognition (CVPR).

  • Settles, B. (2010). Active learning literature survey. Technical report.

  • Shakhnarovich, G., Viola, P., & Darrell, T. (2003). Fast pose estimation with parameter-sensitive hashing. In Proceedings of the IEEE international conference on computer vision (ICCV).

  • Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011). Real-time human pose recognition in parts from single depth images. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Shrivastava, A., Gupta, A., & Girshick, R. (2016). Training region-based object detectors with online hard example mining. In CVPR.

  • Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., & Webb, R. (2017). Learning from simulated and unsupervised images through adversarial training. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Shrivastava, A., Singh, S., & Gupta, A. (2012). Constrained semi-supervised learning using attributes and comparative attributes. In Proceedings of European conference on computer vision (ECCV).

  • Siddiquie, B., Feris, R., & Davis, L. (2011). Image ranking and retrieval based on multi-attribute queries. In CVPR.

  • Simard, P., Steinkraus, D., & Platt, J. (2003). Best practices for convolutional neural networks applied to visual document analysis. In ICDAR.

  • Singh, K., & Lee, Y. J. (2016). End-to-end localization and ranking for relative attributes. In Proceedings of European conference on computer vision (ECCV).

  • Souri, Y., Noury, E., & Adeli, E. (2016). Deep relative attributes. In Asian conference on computer vision (ACCV).

  • Su, J.-C., Wu, C., Jiang, H., & Maji, S. (2017). Reasoning about fine-grained attribute phrases using reference games. In Proceedings of the IEEE international conference on computer vision (ICCV).

  • Tong, S., & Koller, D. (1998). Support vector machine active learning with applications to text classification. In ICML.

  • Upchurch, P., Pleiss, J. G. G., Pless, R., Snavely, N., Bala, K., & Weinberger, K. (2017). Deep feature interpolation for image content changes. In CVPR.

  • Varol, G., Romero, J., Martin, X., Mahmood, N., Black, M. J., Laptev, I., & Schmid, C. (2017). Learning from synthetic humans. In CVPR.

  • Verma, V., Arora, G., Mishra, A., & Rai, P. (2018). Generalized zero-shot learning via synthesized examples. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Vijayanarasimhan, S., & Grauman, K. (2009). What’s it going to cost you?: Predicting effort vs. informativeness for multi-label image annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Vijayanarasimhan, S., & Grauman, K. (2014). Large-scale live active learning: Training object detectors with crawled data and crowds. International Journal of Computer Vision (IJCV), 108(1), 97–114.

    Article  MathSciNet  Google Scholar 

  • Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P. (2008). Extracting and composing robust features with denoising autoencoders. In Proceedings of international conference on machine learning (ICML).

  • Xian, Y., Lorenz, T., Schiele, B., & Akata, Z. (2018). Feature generating networks for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Xiao, F., & Lee, Y. J. (2015). Discovering the spatial extent of relative attributes. In Proceedings of the IEEE international conference on computer vision (ICCV).

  • Yan, X., Yang, J., Sohn, K., & Lee, H. (2016). Attribute2Image: Conditional image generation from visual attributes. In Proceedings of European conference on computer vision (ECCV).

  • Yan, X., Yang, J., Sohn, K., & Lee, H. (2016). Attribute2Image: Conditional image generation from visual attributes. Technical Report arXiv:1512.00570.

  • Yang, D., & Deng, J. (2017). Shape from shading through shape evolution. Technical Report arXiv:1712.02961.

  • Yang, L., Luo, P., Loy, C. C., & Tang, X. (2015). A large-scale car dataset for fine-grained categorization and verification. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Yang, X., Zhang, T., Xu, C., Yan, S., Hossain, M., & Ghoneim, A. (2016). Deep relative attributes. IEEE Transactions on Multimedia, 18(9).

  • Yao, T., Pan, Y., Li, Y., Qiu, Z., & Mei, T. (2017). Boosting image captioning with attributes. In Proceedings of the IEEE international conference on computer vision (ICCV).

  • Yu, A., & Grauman, K. (2014). Fine-grained visual comparisons with local learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Yu, A., & Grauman, K. (2017). Semantic Jitter: Dense supervision for visual comparisons via synthetic images. In Proceedings of the IEEE international conference on computer vision (ICCV).

  • Yu, A., & Grauman, K. (2019). Thinking outside the pool: Active training image creation for relative attributes. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Yumer, M. E., Chaudhuri, S., Hodgins, J. K., & Kara, L. B. (2015). Semantic shape editing using deformation handles. ACM Transactions on Graphics, 34(4), 86:1–86:12.

    Article  Google Scholar 

  • Zhang, G., Kan, M., Shan, S., & Chen, X. (2018). Generative adversarial network with spatial attention for face attribute editing. In ECCV.

  • Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X. & Metaxas, D. (2017) Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV.

  • Zhang, Y., Song, S., Yumer, E., Savva, M., Lee, J., Jin, H., & Funkhouser, T. (2017). Physically-based rendering for indoor scene understanding using convolutional neural networks. In CVPR.

  • Zhao, L., Sukthankar, G., & Sukthankar, R. (2011). Robust active learning using crowdsourced annotations for activity recognition. In HCOMP.

  • Zhu, J.-J., & Bento, J. (2017). Generative adversarial active learning. Technical Report arXiv:1702.07956.

  • Zhu, J.-Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision (ICCV).

  • Zhu, Y., Elhoseiny, M., Liu, B., Peng, X., & Elgammal, A. (2018). A generative adversarial approach for zero-shot learning from noisy texts. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

Download references

Acknowledgements

We thank Bo Xiong, Qiang Liu, and Xinchen Yan for helpful discussions. We also thank the anonymous reviewers for their valuable comments. The University of Texas at Austin is supported in part by ONR PECASE N00014-15-1-2291 and NSF IIS-1514118.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aron Yu.

Additional information

Communicated by Jun-Yan Zhu, Hongsheng Li, Eli Shechtman, Ming-Yu Liu, Jan Kautz, Antonio Torralba.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A Appendix

A Appendix

To obtain our fine-grained lexicon, we design our experiments in the form of “complete the sentence” questions and test them on the Amazon MTurk workers. We experiment with two kinds of designs: Design 1 compares two individual images while Design 2 compares one image against a group of six images. Given the meta-data which contains a category (i.e. slippers, boots) and subcategory (i.e. flats, ankle high) labels for each image, we combine these labels into a set of 21 unique category-subcategory pseudo-classes (i.e. slippers-flats, shoes-loaders). Using theses new pseudo-classes, we sample 4000 supervision pairs (for each design) where 80% are comparing within the same pseudo-class and 20% are comparing within the same category. By focusing sampled pairs among items within a pseudo-class, we aim for a majority of the pairs to contain visually quite related items, thus forcing the human subjects to zero in on fine-grained differences.

For each question, the workers are asked to complete the sentence, “Shoe A is a little more/less \(\langle \)insert word\(\rangle \) than Shoe B” using a single word (“Shoe B” is replaced by “Group B” for Design 2). They are instructed to identify subtle differences between the images and provide a short rationale to elaborate on their choices. Figure 20 shows a screenshot of a sample question.

We post-process the fine-grained word suggestions through correcting for human variations (i.e. misspelling, word forms), merging of visual synonyms/antonyms, and evaluation of the rationales. For example, “casual” and “formal” are visual antonyms and workers used similar keywords in their rationales for “durable” and “rugged”. In both cases, the frequency counts for the two words are combined. Over 1000 MTurk workers participated in our study, yielding a total of 350+ distinct word suggestionsFootnote 10. In the end, we select the 10 most frequently appearing words as our fine-grained relative attribute lexicon for shoes.

1.1 A.1 Fine-Grained Lexicon Experiment

For future work, we plan to expand the densification idea in several directions: (1) Our current models learn one ranker per attribute, which can be costly as we scale up the total number of attributes. We would like to explore joint multi-attribute models for the ranker and consider how a human-in-the-loop system might allow simultaneous refinement of the image generation scheme. (2) The performance of our data augmentation approaches is highly dependent on the quality of the pre-trained attribute-conditioned image generator. We would like to explore the use of alternative image generators that can handle higher resolution images and potentially generate more realistic looking images. (3) While our focus is the ranking task for fine-grained visual comparisons, it would be interesting future work to test our ideas applied to classification models.

1.2 A.2 Collect Pairwise Labels

In Fig. 21, we show the interface we used to collect labels from human annotators on MTurk for the actively synthesized training images. In addition to the relative decision, we also instruct the workers to reflect their level of confidence with their decision. Image pairs with low overall confidence and/or low agreement among workers are pruned and not used in training.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, A., Grauman, K. Densifying Supervision for Fine-Grained Visual Comparisons. Int J Comput Vis 128, 2704–2730 (2020). https://doi.org/10.1007/s11263-020-01344-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-020-01344-9

Keywords

Navigation