Skip to main content
Log in

Compositional Convolutional Neural Networks: A Robust and Interpretable Model for Object Recognition Under Occlusion

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Computer vision systems in real-world applications need to be robust to partial occlusion while also being explainable. In this work, we show that black-box deep convolutional neural networks (DCNNs) have only limited robustness to partial occlusion. We overcome these limitations by unifying DCNNs with part-based models into Compositional Convolutional Neural Networks (CompositionalNets)—an interpretable deep architecture with innate robustness to partial occlusion. Specifically, we propose to replace the fully connected classification head of DCNNs with a differentiable compositional model that can be trained end-to-end. The structure of the compositional model enables CompositionalNets to decompose images into objects and context, as well as to further decompose object representations in terms of individual parts and the objects’ pose. The generative nature of our compositional model enables it to localize occluders and to recognize objects based on their non-occluded parts. We conduct extensive experiments in terms of image classification and object detection on images of artificially occluded objects from the PASCAL3D+ and ImageNet dataset, and real images of partially occluded vehicles from the MS-COCO dataset. Our experiments show that CompositionalNets made from several popular DCNN backbones (VGG-16, ResNet50, ResNext) improve by a large margin over their non-compositional counterparts at classifying and detecting partially occluded objects. Furthermore, they can localize occluders accurately despite being trained with class-level supervision only. Finally, we demonstrate that CompositionalNets provide human interpretable predictions as their individual components can be understood as detecting parts and estimating an objects’ viewpoint.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. https://github.com/AdamKortylewski/CompositionalNets

References

  • Alain, G., & Bengio, Y. (2016). Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644.

  • Arthur, D., & Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms.

  • Banerjee, A., Dhillon, I. S., Ghosh, J., & Sra, S. (2005). Clustering on the unit hypersphere using von mises-fisher distributions. Journal of Machine Learning Research, 6(Sep), 1345–1382.

    MathSciNet  MATH  Google Scholar 

  • Bau, D., Zhou, B., Khosla, A., Oliva, A., & Torralba, A. (2017). Network dissection: Quantifying interpretability of deep visual representations. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6541–6549).

  • Brendel, W., & Bethge, M. (2019). Approximating cnns with bag-of-local-features models works surprisingly well on imagenet. arXiv preprint arXiv:1904.00760.

  • Cai, Z., & Vasconcelos, N. (2018). Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6154–6162).

  • Chen, Y., Zhu, L., Lin, C., Zhang, H., & Yuille, A. L. (2008). Rapid inference on a novel and/or graph for object detection, segmentation and parsing. In Advances in neural information processing systems (pp. 289–296).

  • Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., & Le, Q. V. (2018). Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501.

  • Dai, J., Hong, Y., Hu, W., Zhu, S. C., & Nian Wu, Y. (2014) Unsupervised learning of dictionaries of hierarchical compositional models. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2505–2512).

  • Dechter, R., & Mateescu, R. (2007). And/or search spaces for graphical models. Artificial Intelligence, 171(2–3), 73–106.

    Article  MathSciNet  Google Scholar 

  • Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255). IEEE.

  • DeVries, T., & Taylor, G. W. (2017). Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552.

  • Economist, T. (2017). Why uber’s self-driving car killed a pedestrian.

  • Fawzi, A., & Frossard, P. (2016). Measuring the effect of nuisance variables on classifiers. Technical report.

  • Fidler, S., Boben, M., & Leonardis, A. (2014). Learning a hierarchical compositional shape vocabulary for multi-class object representation. arXiv preprint arXiv:1408.5516.

  • Fong, R., & Vedaldi, A. (2018). Net2vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8730–8738).

  • George, D., Lehrach, W., Kansky, K., Lázaro-Gredilla, M., Laan, C., Marthi, B., et al. (2017). A generative vision model that trains with high data efficiency and breaks text-based captchas. Science, 358(6368), eaag2612.

    Article  Google Scholar 

  • Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 1440–1448).

  • Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 580–587).

  • Girshick, R., Iandola, F., Darrell, T., & Malik, J. (2015). Deformable part models are convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 437–446).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

  • Hu, Z., Ma, X., Liu, Z., Hovy, E., & Xing, E. (2016). Harnessing deep neural networks with logic rules. arXiv preprint arXiv:1603.06318.

  • Huber, P. J. (2011). Robust statistics. Berlin: Springer.

    Google Scholar 

  • Jian Sun, Y. L., & Kang, S. B. (2018). Symmetric stereo matching for occlusion handling. In IEEE conference on computer vision and pattern recognition.

  • Jin, Y., & Geman, S. (2006). Context and hierarchy in a probabilistic image model. In 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06) (vol. 2, pp. 2145–2152). IEEE.

  • Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

  • Kortylewski, A. (2017). Model-based image analysis for forensic shoe print recognition. Ph.D. thesis, University\(\_\)of\(\_\)Basel.

  • Kortylewski, A., He, J., Liu, Q., & Yuille, A. (2020a). Compositional convolutional neural networks: A deep architecture with innate robustness to partial occlusion. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Kortylewski, A., Liu, Q., Wang, H., Zhang, Z., & Yuille, A. (2020b). Combining compositional models and deep networks for robust object classification under occlusion. In The IEEE Winter Conference on Applications of Computer Vision.

  • Kortylewski, A., & Vetter, T. (2016). Probabilistic compositional active basis models for robust pattern recognition. In British machine vision conference.

  • Kortylewski, A., Wieczorek, A., Wieser, M., Blumer, C., Parbhoo, S., Morel-Forster, A., Roth, V., & Vetter, T. (2019). Greedy structure learning of hierarchical compositional models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 11612–11621).

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105).

  • Lampert, C. H., Blaschko, M. B., & Hofmann, T. (2008). Beyond sliding windows: Object localization by efficient subwindow search. In 2008 IEEE conference on computer vision and pattern recognition (pp. 1–8). IEEE.

  • Le, Q. V. (2013). Building high-level features using large scale unsupervised learning. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 8595–8598). IEEE.

  • Li, A., & Yuan, Z. (2018). Symmnet: A symmetric convolutional neural network for occlusion detection. In British machine vision conference.

  • Li, X., Song, X., & Wu, T. (2019). Aognets: Compositional grammatical architectures for deep learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6220–6230).

  • Li, Y., Li, B., Tian, B., & Yao, Q. (2013). Vehicle detection based on the and-or graph for congested traffic conditions. IEEE Transactions on Intelligent Transportation Systems, 14(2), 984–993.

    Article  Google Scholar 

  • Liao, R., Schwing, A., Zemel, R., & Urtasun, R. (2016). Learning deep parsimonious representations. In Advances in neural information processing systems (pp. 5076–5084).

  • Lin, L., Wang, X., Yang, W., & Lai, J. H. (2014). Discriminatively trained and-or graph models for object shape detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(5), 959–972.

    Article  Google Scholar 

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In: European conference on computer vision (pp. 740–755). Springer.

  • Mahendran, A., & Vedaldi, A. (2015). Understanding deep image representations by inverting them. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5188–5196).

  • Montavon, G., Samek, W., & Müller, K. R. (2018). Methods for interpreting and understanding deep neural networks. Digital Signal Processing, 73, 1–15.

    Article  MathSciNet  Google Scholar 

  • Narasimhan, N. D. R. M. V. S. G. (2019). Occlusion-net: 2d/3d occluded keypoint localization using graph networks. IEEE conference on computer vision and pattern recognition.

  • Nguyen, A., Dosovitskiy, A., Yosinski, J., Brox, T., & Clune, J. (2016). Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. In Advances in neural information processing systems (pp. 3387–3395).

  • Nilsson, N. J., et al. (1980). Principles of artificial intelligence.

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91–99).

  • Ross, A. S., Hughes, M. C., & Doshi-Velez, F. (2017) Right for the right reasons: Training differentiable models by constraining their explanations. arXiv preprint arXiv:1703.03717.

  • Sabour, S., Frosst, N., & Hinton, G. E. (2017). Dynamic routing between capsules. In Advances in neural information processing systems (pp. 3856–3866).

  • Simonyan, K., Vedaldi, A., & Zisserman, A. (2013). Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034.

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

  • Song, X., Wu, T., Jia, Y., & Zhu, S. C. (2013). Discriminatively trained and-or tree models for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3278–3285).

  • Stone, A., Wang, H., Stark, M., Liu, Y., Scott Phoenix, D., & George, D. (2017). Teaching compositionality to cnns. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5058–5067).

  • Tabernik, D., Kristan, M., Wyatt, J. L., & Leonardis, A. (2016). Towards deep compositional networks. In 2016 23rd international conference on pattern recognition (ICPR) (pp. 3470–3475). IEEE.

  • Tang, W., Yu, P., & Wu, Y. (2018). Deeply learned compositional models for human pose estimation. In Proceedings of the European conference on computer vision (ECCV) (pp. 190–206).

  • Tang, W., Yu, P., Zhou, J., & Wu, Y. (2017). Towards a unified compositional model for visual pattern modeling. In Proceedings of the IEEE international conference on computer vision (pp. 2784–2793).

  • Wang, A., Sun, Y., Kortylewski, A., & Yuille, A. (2020). Robust object detection under occlusion with context-aware compositionalnets. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Wang, J., Xie, C., Zhang, Z., Zhu, J., Xie, L., & Yuille, A. (2017). Detecting semantic parts on partially occluded objects. arXiv preprint arXiv:1707.07819.

  • Wang, J., Zhang, Z., Xie, C., Premachandran, V., & Yuille, A. (2015) Unsupervised learning of object semantic parts from internal states of cnns by population encoding. arXiv preprint arXiv:1511.06855.

  • Wang, J., Zhang, Z., Xie, C., Zhou, Y., Premachandran, V., Zhu, J., Xie, L., & Yuille, A. (2017). Visual concepts and compositional voting. arXiv preprint arXiv:1711.04451.

  • Wu, T., Li, B., & Zhu, S. C. (2015). Learning and-or model to represent context and occlusion for car detection and viewpoint estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(9), 1829–1843.

    Article  Google Scholar 

  • Xia, F., Zhu, J., Wang, P., & Yuille, A. L. (2016). Pose-guided human parsing by an and/or graph using pose-context features. In Thirtieth AAAI conference on artificial intelligence.

  • Xiang, Y., Mottaghi, R., & Savarese, S. (2014). Beyond pascal: A benchmark for 3d object detection in the wild. In IEEE winter conference on applications of computer vision (pp. 75–820). IEEE.

  • Xiang, Y., & Savarese, S. (2013). Object detection by 3d aspectlets and occlusion reasoning.

  • Xiao, M., Kortylewski, A., Wu, R., Qiao, S., Shen, W., & Yuille, A. (2019). Tdapnet: Prototype network with recurrent top-down attention for robust object classification under partial occlusion. arXiv preprint arXiv:1909.03879.

  • Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1492–1500).

  • Yan, S., & Liu, Q. (2015). Inferring occluded features for fast object detection. In Signal processing (Vol 110).

  • Yuille, A. L., & Liu, C. (2018). Deep nets: What have they ever done for vision? arXiv preprint arXiv:1805.04025.

  • Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., & Yoo, Y. (2019). Cutmix: Regularization strategy to train strong classifiers with localizable features. arXiv preprint arXiv:1905.04899.

  • Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In European conference on computer vision (pp. 818–833). Springer.

  • Zhang, Q., Nian Wu, Y., & Zhu, S. C. (2018a). Interpretable convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8827–8836).

  • Zhang, Q. S., & Zhu, S. C. (2018). Visual interpretability for deep learning: A survey. Frontiers of Information Technology and Electronic Engineering, 19(1), 27–39.

    Article  Google Scholar 

  • Zhang, S., Wen, L., Bian, X., Lei, Z., & Li, S. Z. (2018b). Occlusion-aware r-cnn: detecting pedestrians in a crowd, pp. 637–653.

  • Zhang, Z., Xie, C., Wang, J., Xie, L., & Yuille, A. L. (2018c). Deepvoting: A robust and explainable deep network for semantic part detection under partial occlusion. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1372–1380).

  • Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2015). Object detectors emerge in deep scene cnns. In ICLR.

  • Zhu, H., Tang, P., Park, J., Park, S., & Yuille, A. (2019). Robustness of object recognition under extreme occlusion in humans and computational models. In CogSci conference.

  • Zhu, L., Chen, Y., Lu, Y., Lin, C., & Yuille, A. (2008a). Max margin and/or graph learning for parsing the human body. In 2008 IEEE conference on computer vision and pattern recognition (pp. 1–8). IEEE.

  • Zhu, L. L., Lin, C., Huang, H., Chen, Y., & Yuille, A. (2008). Unsupervised structure learning: Hierarchical recursive composition, suspicious coincidence and competitive exclusion. In Computer vision–eccv 2008 (pp. 759–773). Springer.

Download references

Acknowledgements

Funding was provided by the Swiss National Science Foundation (P2BSP2.181713) and the Office of Naval Research (Grant No. N00014-18-1-2119 and N00014-20-1-2206).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adam Kortylewski.

Additional information

Communicated by Mei Chen.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kortylewski, A., Liu, Q., Wang, A. et al. Compositional Convolutional Neural Networks: A Robust and Interpretable Model for Object Recognition Under Occlusion. Int J Comput Vis 129, 736–760 (2021). https://doi.org/10.1007/s11263-020-01401-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-020-01401-3

Keywords

Navigation