Pattern Recognition ( IF 7.5 ) Pub Date : 2021-03-18 , DOI: 10.1016/j.patcog.2021.107947 Yi Niu , Yang Jiao , Guangming Shi
Fine-grained visual categorization (FGVC) has attracted extensive attention in recent years. The general pipeline of current FGVC techniques is to 1) locate the discriminative regions; 2) extract features from each region independently; and 3) feed the integrated features to a classifier. In this paper, we re-investigate the pipeline from the view of human visual recognition mechanisms. The perceiving of discriminative regions is a temporal processing by the human visual system (HVS) via the attention-shift mechanism. However, the existing independent feature extracting and one-pass feeding strategy ignore the inherent semantic relationships among discriminative regions, and thus is improper to model the attention-shift process properly. Therefore, in this paper, we propose a novel end-to-end FGVC network structure named Attention-Shift based Deep Neural Network (AS-DNN) to locate the discriminative regions automatically and encode the semantic correlations iteratively. AS-DNN consists of two channels: 1) the global perception channel and 2) the attention-shift channel simulating the global perception and the attention-shift mechanism, respectively. Experimental results show that AS-DNN achieves state-of-the-art performances by outperforming both the CNN-based weakly or strongly-supervised FGVC algorithms on several widely-used fine-grained datasets, and the visualization of attention regions exhibit that the proposed method can locate the discriminative regions robustly in complex backgrounds and postures.
中文翻译:
基于注意力转移的深度神经网络用于细粒度的视觉分类
近年来,细粒度的视觉分类(FGVC)引起了广泛的关注。当前的FGVC技术的一般流程是:1)定位区分区域;2)独立地从每个区域提取特征;3)将集成的功能提供给分类器。在本文中,我们从人类视觉识别机制的角度对管道进行了重新调查。区分区域的感知是人类视觉系统(HVS)通过注意力转移机制进行的时间处理。然而,现有的独立特征提取和单次通过策略忽略了区分区域之间固有的语义关系,因此不能正确地建立注意力转移过程的模型。因此,在本文中,我们提出了一种新颖的端到端FGVC网络结构,称为基于注意力转移的深度神经网络(AS-DNN),以自动定位可区分区域并迭代编码语义相关性。AS-DNN包含两个渠道:1)全球感知渠道 和2)注意力转移通道 分别模拟全局感知和注意力转移机制。实验结果表明,AS-DNN通过在几个广泛使用的细粒度数据集上胜过基于CNN的弱或强监督FGVC算法,从而获得了最新的性能,并且关注区域的可视化表明该建议该方法可以在复杂的背景和姿势中稳健地定位区分区域。