当前位置: X-MOL 学术IEEE Trans. Pattern Anal. Mach. Intell. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Generalized Few-Shot Video Classification With Video Retrieval and Feature Generation.
IEEE Transactions on Pattern Analysis and Machine Intelligence ( IF 20.8 ) Pub Date : 2022-11-07 , DOI: 10.1109/tpami.2021.3120550
Yongqin Xian 1 , Bruno Korbar 2 , Matthijs Douze 3 , Lorenzo Torresani 3 , Bernt Schiele 4 , Zeynep Akata 5
Affiliation  

Few-shot learning aims to recognize novel classes from a few examples. Although significant progress has been made in the image domain, few-shot video classification is relatively unexplored. We argue that previous methods underestimate the importance of video feature learning and propose to learn spatiotemporal features using a 3D CNN. Proposing a two-stage approach that learns video features on base classes followed by fine-tuning the classifiers on novel classes, we show that this simple baseline approach outperforms prior few-shot video classification methods by over 20 points on existing benchmarks. To circumvent the need of labeled examples, we present two novel approaches that yield further improvement. First, we leverage tag-labeled videos from a large dataset using tag retrieval followed by selecting the best clips with visual similarities. Second, we learn generative adversarial networks that generate video features of novel classes from their semantic embeddings. Moreover, we find existing benchmarks are limited because they only focus on 5 novel classes in each testing episode and introduce more realistic benchmarks by involving more novel classes, i.e., few-shot learning, as well as a mixture of novel and base classes, i.e., generalized few-shot learning. The experimental results show that our retrieval and feature generation approach significantly outperform the baseline approach on the new benchmarks.

中文翻译:

具有视频检索和特征生成的广义少镜头视频分类。

小样本学习旨在从几个例子中识别出新颖的类。尽管在图像领域取得了重大进展,但少镜头视频分类相对来说还没有被探索。我们认为,以前的方法低估了视频特征学习的重要性,并建议使用 3D CNN 来学习时空特征。提出了一种两阶段方法,该方法在基类上学习视频特征,然后对新类上的分类器进行微调,我们表明,这种简单的基线方法在现有基准上优于先前的少镜头视频分类方法 20 多个点。为了规避标记示例的需要,我们提出了两种可以进一步改进的新方法。第一的,我们利用标签检索从大型数据集中使用标签标记的视频,然后选择具有视觉相似性的最佳剪辑。其次,我们学习生成对抗网络,这些网络从它们的语义嵌入中生成新类的视频特征。此外,我们发现现有的基准是有限的,因为它们在每个测试集中只关注 5 个新类,并通过涉及更多新类(即少样本学习)以及新类和基类的混合引入更现实的基准,即,广义的少样本学习。实验结果表明,我们的检索和特征生成方法在新基准上明显优于基线方法。我们发现现有的基准是有限的,因为它们在每个测试集中只关注 5 个新类,并通过涉及更多新类(即少样本学习)以及新类和基类的混合(即泛化)来引入更现实的基准少量学习。实验结果表明,我们的检索和特征生成方法在新基准上明显优于基线方法。我们发现现有的基准是有限的,因为它们在每个测试集中只关注 5 个新类,并通过涉及更多新类(即少样本学习)以及新类和基类的混合(即泛化)来引入更现实的基准少量学习。实验结果表明,我们的检索和特征生成方法在新基准上明显优于基线方法。
更新日期:2021-10-15
down
wechat
bug