当前位置: X-MOL 学术arXiv.cs.CV › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
VideoLT: Large-scale Long-tailed Video Recognition
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2021-05-06 , DOI: arxiv-2105.02668
Xing Zhang, Zuxuan Wu, Zejia Weng, Huazhu Fu, Jingjing Chen, Yu-Gang Jiang, Larry Davis

Label distributions in real-world are oftentimes long-tailed and imbalanced, resulting in biased models towards dominant labels. While long-tailed recognition has been extensively studied for image classification tasks, limited effort has been made for video domain. In this paper, we introduce VideoLT, a large-scale long-tailed video recognition dataset, as a step toward real-world video recognition. Our VideoLT contains 256,218 untrimmed videos, annotated into 1,004 classes with a long-tailed distribution. Through extensive studies, we demonstrate that state-of-the-art methods used for long-tailed image recognition do not perform well in the video domain due to the additional temporal dimension in video data. This motivates us to propose FrameStack, a simple yet effective method for long-tailed video recognition task. In particular, FrameStack performs sampling at the frame-level in order to balance class distributions, and the sampling ratio is dynamically determined using knowledge derived from the network during training. Experimental results demonstrate that FrameStack can improve classification performance without sacrificing overall accuracy.

中文翻译:

VideoLT:大型长尾视频识别

现实世界中的标签分布通常是长尾且不平衡的,从而导致模型偏向主导标签。虽然长尾识别已被广泛地研究用于图像分类任务,但是在视频领域却做出了有限的努力。在本文中,我们介绍了VideoLT,这是一个大规模的长尾视频识别数据集,是迈向真实世界视频识别的一步。我们的VideoLT包含256,218个未修剪的视频,注释为1,004类,并有长尾分布。通过广泛的研究,我们证明了由于视频数据中额外的时间维度,用于长尾图像识别的最新方法在视频领域中效果不佳。这促使我们提出FrameStack,这是一种用于长尾视频识别任务的简单而有效的方法。特别是,FrameStack在帧级别执行采样以平衡类分布,并且在训练过程中使用从网络获得的知识动态确定采样率。实验结果表明,FrameStack可以提高分类性能,而不会牺牲整体准确性。
更新日期:2021-05-07
down
wechat
bug