Elsevier

Pattern Recognition Letters

Volume 160, August 2022, Pages 142-147
Pattern Recognition Letters

Baby steps towards few-shot learning with multiple semantics

https://doi.org/10.1016/j.patrec.2022.06.012Get rights and content

Highlights

  • We propose to consider a new, closer to ‘infant learning’, setting of Few-Shot Learning with Multiple and Complex Semantics.

  • In this context we propose a new benchmark for FSL-MCS, and an associated training and evaluation protocol.

  • A new multi-branch architecture that provides the first batch of encouraging results for the proposed FSL-MCS benchmark.

Abstract

Learning from one or few visual examples is one of the key capabilities of humans since early infancy, but is still a significant challenge for modern AI systems. While considerable progress has been achieved in few-shot learning from a few image examples, much less attention has been given to the verbal descriptions that are usually provided to infants when they are presented with a new object. In this paper, we focus on the role of additional semantics that can significantly facilitate few-shot visual learning. Building upon recent advances in few-shot learning with additional semantic information, we demonstrate that further improvements are possible by combining multiple and richer semantics (category labels, attributes, and natural language descriptions). Using these ideas, we offer the community new results on the popular miniImageNet and CUB few-shot benchmarks, comparing favorably to the previous state-of-the-art results for both visual only and visual plus semantics-based approaches. We also performed an ablation study investigating the components and design choices of our approach. Code available on github.com/EliSchwartz/mutiple-semantics.

Introduction

Modern day computer vision has experienced a tremendous leap due to the advent of deep learning (DL) techniques. The DL-based approaches reach higher levels of performance even compared to humans in tasks requiring expertise, such as recognizing dog breeds, or faces of thousands of celebrities. Yet, despite all the advances, some innate human abilities available to us at a very young age, still elude modern AI systems. One of these abilities is to be able to learn and later successfully recognize new, previously unseen, visual categories when presented to us with one or very few examples. This ‘few-shot learning’ task has been thoroughly explored in the computer vision literature and numerous approaches have been proposed, please see [1] for a review. Yet so far, the performance of even the best few-shot learning methods fall short by a significant margin from the performance of the fully supervised learning methods trained with a large number of examples, e.g. ImageNet [2]. It is challenging to adapt a model to novel classes based on few samples without over-fitting.

One key ingredient of human infant learning, which has only very recently found its way into the visual few-shot learning approaches, is the associated semantics that comes with the provided example. For example, it has been shown in the child development literature that infants’ object recognition ability is linked to their language skills and it is hypothesized that it might be related to the ability to describe objects [3]. Indeed, when a parent points a finger at a new category to be learned (‘look, here is a puppy’, Fig. 1), it is commonly accompanied by additional semantic references or descriptions for that category (e.g., ‘look at his nice fluffy ears’, ‘look at his nice silky fur’, ‘the puppy goes woof-woof’). This additional, and often rich, semantic information can be very useful to the learner, and has been exploited in the context of zero-shot learning and visual-semantic embeddings. Indeed, language as well as vision domains, both describe the same physical world in different ways, and in many cases contain useful complementary information that can be carried over to the learner in the other domain (visual to language and vice versa).

Only a handful of works have used semantics to facilitate few-shot learning in recent few-shot learning literature. Chen et al. [4] used an embedding vector of either the category label or of the given set of category attributes to regularize the latent representation of an auto-encoder by adding a loss for making the sample latent vector as close as possible to the corresponding semantic vector. In [5] the semantic representation of visual categories is learned on top of the GloVe [6] word embedding, jointly with a Proto-Net [7] based few-shot classifier, and jointly with the convex combination of both. The result of this joint training is a powerful few-shot and zero-shot (that is a semantic-based) ensemble that surpassed the performance of all other few-shot learning methods to-date on the challenging miniImageNet few-shot learning benchmark [8]. In both of these cases, combining few-shot learning with some category semantics (labels or attributes) proved highly beneficial to the performance of the few-shot learner. Yet in both cases, only the simple one word embedding or a set of several prescribed numerical attributes were used to encode the semantics.

In this work, we show that more can be gained by exploring a more realistic human-like learning setting. This is done by providing the learner access to multiple and richer semantics. Based on what is available for the dataset, these semantics can include: category labels; richer ‘description level’ semantic information (a sentence, or a few sentences, in a natural language with a description of the category); or attributes. We demonstrate how this learning with semantic setting can facilitate few-shot learning (leveraging the intuition of how human infants learn). The results compare favorably with previous visual and visual + semantics state-of-the-art results on the challenging miniImageNet [8] and CUB [9] few-shot benchmarks.

To summarize, the contributions of this work are three-fold. First, we propose the community to consider a new, perhaps closer to ‘infant learning’ setting of Few-Shot Learning with Multiple and Complex Semantics (FSL-MCS). Second, in this context we propose a new benchmark for FSL-MCS, and an associated training and evaluation protocol. Third, we propose a new multi-branch network architecture that provides the first batch of encouraging results for the proposed FSL-MCS setting benchmark.

Section snippets

Few-Shot learning

The major approaches to few-shot learning include: metric learning, meta learning (or learning-to-learn), and generative (or augmentation) based methods.

Few-shot learning by metric learning: These type of methods [7], [10] learn a non-linear embedding into a metric space where L2 nearest neighbor (or similar) approach is used to classify instances of new categories according to their proximity to the few labeled training examples. Additional proposed variants include [11] that uses a metric

Method

Our general model architecture is summarized in Fig. 2. The model is comprised of a visual information branch supported by a CNN backbone computing features both for the training images of the few-shot task and for the query images. As in Proto-Nets [7], the feature vectors for each set of the task category support examples are averaged to form a visual prototype feature vector V for that category. The visual prototype serves as the first estimation of the prototype P0=V. Then, the prototype is

Experimental results

We have evaluated our approach on the challenging few-shot benchmark of miniImageNet [8] used for evaluation by most (if not all) the few-shot learning works. We also evaluated on the CUB dataset [41] which includes another form of semantics, the attributes vector.

Summary & conclusions

In this work, we have proposed an extended approach for few-shot learning with additional semantic information. We suggest bringing few-shot learning with semantics closer to the setting used by human infants: we build on multiple semantic explanations (name, attributes and description) that accompany the few image examples and utilize more complex natural language based semantics rather than just the name of the category. In our experiments, we only touch the tip of the iceberg of the possible

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (44)

  • W.-Y. Chen

    A closer look at few-shot classification

    ICLR

    (2018)
  • O. Russakovsky et al.

    ImageNet large scale visual recognition challenge

    IJCV

    (2015)
  • L.B. Smith

    Learning to recognize objects

    Psychol. Sci.

    (2003)
  • Z. Chen, Y. Fu, Y. Zhang, Y.-G. Jiang, X. Xue, L. Sigal, Semantic feature augmentation in few-shot learning,...
  • C. Xing et al.

    Adaptive cross-modal few-shot learning

    NeurIPS

    (2019)
  • J. Pennington et al.

    Glove: global vectors for word representation

    EMNLP

    (2014)
  • J. Snell et al.

    Prototypical networks for few-shot learning

    NIPS

    (2017)
  • O. Vinyals et al.

    Matching networks for one shot learning

    NIPS

    (2016)
  • P. Welinder et al.

    Caltech-UCSD Birds 200

    Technical Report CNS-TR-2010-001

    (2010)
  • O. Rippel, M. Paluri, P. Dollar, L. Bourdev, Metric learning with adaptive density discrimination, arXiv preprint...
  • V. Garcia, J. Bruna, Few-Shot learning with graph neural networks, (2017) 1–13. arXiv preprint...
  • A. Santoro et al.

    Meta-Learning with memory-Augmented neural networks

    ICML

    (2016)
  • F. Sung et al.

    Learning to compare: Relation network for few-shot learning

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2018)
  • F. Hao et al.

    Collect and select: semantic alignment metric learning for few-shot learning

    Proceedings of the IEEE/CVF International Conference on Computer Vision

    (2019)
  • X. Jiang et al.

    Learning to learn with conditional class dependencies

    ICLR

    (2018)
  • Z. Ji et al.

    Information symmetry matters: a modal-alternating propagation network for few-shot learning

    IEEE Trans. Image Process.

    (2022)
  • C. Finn, P. Abbeel, S. Levine, Model-Agnostic meta-learning for fast adaptation of deep networks, arXiv preprint...
  • Z. Li, F. Zhou, F. Chen, H. Li, Meta-SGD: learning to learn quickly for few-shot learning, arXiv preprint...
  • F. Zhou, B. Wu, Z. Li, Deep meta-learning: learning to learn in the concept space, (2018). arXiv preprint...
  • S. Ravi et al.

    Optimization as a model for few-Shot learning

    ICLR

    (2017)
  • S. Doveh, E. Schwartz, C. Xue, R. Feris, A. Bronstein, R. Giryes, L. Karlinsky, MetAdapt: meta-learned task-adaptive...
  • A.A. Rusu et al.

    Meta-Learning with latent embedding optimization

    ICLR

    (2018)
  • Cited by (30)

    • Multimodal few-shot classification without attribute embedding

      2024, Eurasip Journal on Image and Video Processing
    View all citing articles on Scopus
    1

    Equal contributors.

    View full text