当前位置: X-MOL 学术Vision Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Hiding a plane with a pixel: examining shape-bias in CNNs and the benefit of building in biological constraints.
Vision Research ( IF 1.5 ) Pub Date : 2020-06-28 , DOI: 10.1016/j.visres.2020.04.013
Gaurav Malhotra 1 , Benjamin D Evans 1 , Jeffrey S Bowers 1
Affiliation  

When deep convolutional neural networks (CNNs) are trained “end-to-end” on raw data, some of the feature detectors they develop in their early layers resemble the representations found in early visual cortex. This result has been used to draw parallels between deep learning systems and human visual perception. In this study, we show that when CNNs are trained end-to-end they learn to classify images based on whatever feature is predictive of a category within the dataset. This can lead to bizarre results where CNNs learn idiosyncratic features such as high-frequency noise-like masks. In the extreme case, our results demonstrate image categorisation on the basis of a single pixel. Such features are extremely unlikely to play any role in human object recognition, where experiments have repeatedly shown a strong preference for shape. Through a series of empirical studies with standard high-performance CNNs, we show that these networks do not develop a shape-bias merely through regularisation methods or more ecologically plausible training regimes. These results raise doubts over the assumption that simply learning end-to-end in standard CNNs leads to the emergence of similar representations to the human visual system. In the second part of the paper, we show that CNNs are less reliant on these idiosyncratic features when we forgo end-to-end learning and introduce hard-wired Gabor filters designed to mimic early visual processing in V1.



中文翻译:

用像素隐藏平面:检查 CNN 中的形状偏差以及构建生物约束的好处。

当深度卷积神经网络 (CNN) 在原始数据上进行“端到端”训练时,它们在早期层开发的一些特征检测器类似于早期视觉皮层中的表征。这一结果已被用于绘制深度学习系统和人类视觉感知之间的相似之处。在这项研究中,我们展示了当 CNN 接受端到端训练时,它们学会了根据数据集中的任何类别的预测特征对图像进行分类。这可能会导致奇怪的结果,其中 CNN 学习特殊特征,例如高频噪声样掩码。在极端情况下,我们的结果展示了基于单个像素的图像分类。这些特征极不可能在人类物体识别中发挥任何作用,在人类物体识别中,实验一再显示出对形状的强烈偏好。仅通过正则化方法或更符合生态学的训练制度来消除形状偏差。这些结果引发了对以下假设的怀疑:在标准 CNN 中简单地学习端到端会导致出现与人类视觉系统类似的表示。在论文的第二部分,我们展示了当我们放弃端到端学习并引入旨在模仿 V1 中早期视觉处理的硬连线 Gabor 滤波器时,CNN 对这些特殊特征的依赖程度较低。

更新日期:2020-06-28
down
wechat
bug