当前位置: X-MOL 学术IEEE Trans. Pattern Anal. Mach. Intell. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Bilinear Convolutional Neural Networks for Fine-Grained Visual Recognition
IEEE Transactions on Pattern Analysis and Machine Intelligence ( IF 20.8 ) Pub Date : 2017-07-04 , DOI: 10.1109/tpami.2017.2723400
Tsung-Yu Lin , Aruni RoyChowdhury , Subhransu Maji

We present a simple and effective architecture for fine-grained recognition called Bilinear Convolutional Neural Networks (B-CNNs). These networks represent an image as a pooled outer product of features derived from two CNNs and capture localized feature interactions in a translationally invariant manner. B-CNNs are related to orderless texture representations built on deep features but can be trained in an end-to-end manner. Our most accurate model obtains 84.1, 79.4, 84.5 and 91.3 percent per-image accuracy on the Caltech-UCSD birds [1], NABirds [2], FGVC aircraft [3], and Stanford cars [4] dataset respectively and runs at 30 frames-persecond on a NVIDIA Titan X GPU. We then present a systematic analysis of these networks and show that (1) the bilinear features are highly redundant and can be reduced by an order of magnitude in size without significant loss in accuracy, (2) are also effective for other image classification tasks such as texture and scene recognition, and (3) can be trained from scratch on the ImageNet dataset offering consistent improvements over the baseline architecture. Finally, we present visualizations of these models on various datasets using top activations of neural units and gradient-based inversion techniques. The source code for the complete system is available at http://vis-www.cs.umass.edu/bcnn.

中文翻译:


用于细粒度视觉识别的双线性卷积神经网络



我们提出了一种简单而有效的细粒度识别架构,称为双线性卷积神经网络(B-CNN)。这些网络将图像表示为源自两个 CNN 的特征的汇集外积,并以平移不变的方式捕获局部特征交互。 B-CNN 与基于深层特征构建的无序纹理表示相关,但可以以端到端的方式进行训练。我们最准确的模型在 Caltech-UCSD 鸟类 [1]、NABirds [2]、FGVC 飞机 [3] 和斯坦福汽车 [4] 数据集上的每张图像准确度分别为 84.1%、79.4%、84.5% 和 91.3%,运行速度为 30 NVIDIA Titan X GPU 上的每秒帧数。然后,我们对这些网络进行了系统分析,并表明(1)双线性特征高度冗余,可以在尺寸上减少一个数量级,而不会显着损失准确性,(2)对于其他图像分类任务也有效,例如作为纹理和场景识别,(3) 可以在 ImageNet 数据集上从头开始训练,提供对基线架构的一致改进。最后,我们使用神经单元的顶部激活和基于梯度的反演技术在各种数据集上展示这些模型的可视化。完整系统的源代码可从 http://vis-www.cs.umass.edu/bcnn 获取。
更新日期:2017-07-04
down
wechat
bug