Pointly-supervised scene parsing with uncertainty mixture

doi:10.1016/j.cviu.2020.103040

Computer Vision and Image Understanding

Volume 200, November 2020, 103040

https://doi.org/10.1016/j.cviu.2020.103040 Get rights and content

Highlights

•
We identify the existence of a statistical phenomenon called uncertainty mixture.
•
We propose a threshold-free method to harvest pseudo labels.
•
We contribute a novel regularized Gamma mixture model.
•
We achieve state-of-the-art results on PascalContext and ADE20k.

Abstract

Pointly-supervised learning is an important topic for scene parsing, as dense annotation is extremely expensive and hard to scale. The state-of-the-art method harvests pseudo labels by applying thresholds upon softmax outputs (logits). There are two issues with this practice: (1) Softmax output does not necessarily reflect the confidence of the network output. (2) There is no principled way to decide on the optimal threshold. Tuning thresholds can be time-consuming for deep neural networks. Our method, by contrast, builds upon uncertainty measures instead of logits and is free of threshold tuning. We motivate the method with a large-scale analysis of the distribution of uncertainty measures, using strong models and challenging databases. This analysis leads to the discovery of a statistical phenomenon called uncertainty mixture. Specifically speaking, for each independent category, the distribution of uncertainty measures for unlabeled points is a mixture of two components (certain v.s. uncertain samples). The phenomenon of uncertainty mixture is surprisingly ubiquitous in real-world datasets like PascalContext and ADE20k. Inspired by this discovery, we propose to decompose the distribution of uncertainty measures with a Gamma mixture model, leading to a principled method to harvest reliable pseudo labels. Beyond that, we assume the uncertainty measures for labeled points are always drawn from the certain component. This amounts to a regularized Gamma mixture model. We provide a thorough theoretical analysis of this model, showing that it can be solved with an EM-style algorithm with convergence guarantee. Our method is also empirically successful. On PascalContext and ADE20k, we achieve clear margins over the baseline, notably with no threshold tuning in the pseudo label generation procedure. On the absolute scale, since our method collaborates well with strong baselines, we reach new state-of-the-art performance on both datasets.

Introduction

Dense annotation for scene parsing is very expensive. According to Mottaghi et al. (2014), it took the authors 3 to 5 min to annotate one PASCAL (Everingham et al., 2010) image with dense scene segments. Caesar et al. (2018) exploit superpixels to speed up the annotation of COCO (Lin et al., 2014) images, while still using 3 min on average. Thus annotating scenes with point clicks and semantic class assignment is an appealing alternative, and training with this kind of labels is called pointly-supervised scene parsing in this paper.¹ Specifically speaking, full supervision corresponds to the right-bottom parts of each panel in Fig. 1, and point supervision is enlarged and overlapped onto the input images as the left-top parts demonstrate. This setting is quite challenging because more than 99% regions in the images are not annotated. Besides, training with few samples is an exciting topic, which may help us understand the nature of modern models. For example, we identify a previously unknown statistical phenomenon called uncertainty mixture, which is an intriguing property of deep neural networks. In summary of this paragraph, the problem of pointly-supervised scene parsing is (1) urgent, (2) difficult and (3) potentially fruitful.

This work focuses on harvesting pseudo labels for the pointly-supervised training set. Specifically speaking, one can train a model using only the point supervision, which is referred to as the first round model. This model would produce a semantic label prediction for every pixel in the training set, as demonstrated in the right-top parts of each panel in Fig. 1. They are referred to as pseudo labels throughout this paper. They are erroneous, but containing right ones. The central problem considered here is how to harvest as-many-as-possible good pseudo labels for training. The state-of-the-art method (Qian et al., 2019) expands point supervision into regions and regards them as trustworthy pseudo labels. It is proven effective on public benchmarks yet there exist two critical issues:

Firstly, it relies upon pixel-wise softmax outputs, or say logits. It is known that logits can be over-confident upon wrong prediction (Li and Hoiem, 2018). The reason behind is that softmax result is only a single point estimate of the predictive distribution. Instead, we compute the uncertainty measures for these predictions, depicted in the left-bottom parts of each panel in Fig. 1. They faithfully reflect the confidence of the network outputs. It is obvious that harvesting pseudo labels with low uncertainty (low color temperature in Fig. 1) is a promising solution. However, how do we properly define ’low uncertainty’?

Secondly, harvesting pseudo labels using logits would introduce thresholds. It is very time-consuming to tune thresholds for modern deep networks. Meanwhile, using the natural thresholds generated by argmax would lead to a trivial usage of the pseudo labels. One may argue that using uncertainty measures still involves defining a low uncertainty threshold, as just mentioned in the last bullet point. We show that this problem can be resolved by exploiting a newly discovered statistical phenomenon called uncertainty mixture. It allows us to decide on the optimal threshold of uncertainty measures in an automatic way. Specifically speaking, we decompose the uncertainty measures for unlabeled points with a Gamma mixture and harvest pseudo labels belonging to the certain component.

Beyond the direct application of Gamma mixture, a new regularized model tailored for our problem is proposed, analyzed, implemented and evaluated. We assume the uncertainty measures for labeled points are drawn from the certain component. Intuitively, this helps the mixture model to better capture the shape of the certain component. Mathematically, this amounts to an added regularization term in the objective function of an EM procedure. Since the model has not been visited before, we present a systematic exposition of its analytical properties, from convergence guarantee to solver details. Empirically, this regularized Gamma mixture harvests pseudo labels of higher quality than the baseline and leads to better scene parsing performance, in all experimental settings we inspected.

Last but not least, our method is extensively benchmarked on challenging public datasets, namely PascalContext and ADE20k. It turns out that our method works robustly in various settings. This robustness is attributed to the nature of the method: our mixture modeling decides on the optimal threshold of uncertainty measures automatically. On an absolute scale, our method collaborates well with strong network architectures and training techniques, resulting in new state-of-the-art performance on both datasets. We believe our solution to be a useful one due to its robustness and good performance.

In summary, we claim four contributions for this study, in a descending order of generalization and importance:

•
While uncertainty in deep learning is drawing more attention recently, understanding towards its behaviors in large-scale problems is still sparse. We present an analysis in the field of pointly-supervised scene parsing. The existence of a previously unknown statistical phenomenon called uncertainty mixture is revealed. By existence, we mean: (1) it is ubiquitously observed in real-world datasets; (2) it is understandable considering the nature of uncertainty; (3) the empirical success of our method (inspired by the phenomenon) retrospectively supports its existence.
•
Encouraged by the existence of uncertainty mixture, a newmethodology is proposed: to decompose the distribution of uncertainty measures into certain and uncertain components and harvest pseudo labels belonging to the certain component. Under this perspective, those two critical issues mentioned above are resolved in a principled manner.
•
A novel mixture model is introduced, which is tightly related to a unique feature of the problem considered. Making full use of labeled points, the uncertainty measures for them are formulated as a regularization term in the objective function. A detailed analytical exposition is provided, covering important aspects of this new model.
•
Extensive evaluations on public benchmarks are conducted and analyzed. With ablative studies, we validate the effectiveness of key ingredients in our methods. On an absolute scale, we achieve new state-of-the-art results on PascalContext and ADE20k. This method won the pointly-supervised scene parsing track of CVPR 2020 challenge on learning from imperfect data.

Section snippets

Mixture models in computer vision

Finite mixture models have been long regarded as a powerful tool for computer vision problems. Stauffer and Grimson (1999) model pixel intensities with Gaussian mixtures, so as to distinguish between foreground and background pixels in streams captured by static cameras. Jepson and Black (1993) consider the motion of a patch as a finite mixture, in order to cover the multi-direction optical flow caused by occlusion and transparency. Straub et al. (2014) introduce the mixture model of Manhattan

Preliminaries

For clarity, we give necessary explanations for notations and dropout-based uncertainty measure in this section.

Uncertainty mixture

In this section, we demonstrate the existence of uncertainty mixture with diverse results in real-world datasets.

Pointly-supervised scene parsing

In this section, we describe how to exploit uncertainty mixture for better pointly-supervised scene parsing.

Regularized gamma mixture model

In short, our regularized model incorporates the uncertainty measures for labeled points into the formulation. Similar to the generation of the category-wise vector of $U_{j}$ , we collect the uncertainty measures for labeled points (of category $j$ ) as $\bar{U_{j}}$ . The task remains the same: automatically estimating the parameters ${θ_{1 j}, θ_{2 j}}$ . The core assumption is that $\bar{U_{j}}$ is drawn from LUC. Empirically $\bar{U_{j}}$ samples have an uncertainty measure marginally larger than zero. The reason has been discussed in

Protocol

As mentioned above, all our evaluations are done with PSPNets(Zhao et al., 2017b) with DRN backbones (Yu et al., 2017). We consider three backbones of different capacities: 22-layer DRN, 54-layer DRN and 105-layer DRN. We report results on two representative benchmarks PASCALContext (Mottaghi et al., 2014) and ADE20k (Zhou et al., 2017), using the standard splits. For PASCALContext, 4998 samples are used for training and 5105 samples are used for testing. For ADE20k, 20210 images are used for

Conclusions

We study the problem of pointly-supervised scene paring in this paper and make four contributions to the community. Firstly, we conduct a large-scale statistical analysis of the category-wise uncertainty measure for this setting. The major outcome of this analysis is the discovery of a phenomenon called uncertainty mixture. We believe it is of interest to many researchers as it reveals an intriguing property of deep models. Secondly, inspired by the phenomenon, we propose a pipeline that

CRediT authorship contribution statement

Hao Zhao: Proposing the idea, Designing/conducting experiments, Theoretical analysis, Writing - original draft. Ming Lu: Conducting experiments. Anbang Yao: Writing - review & editing. Yiwen Guo: Theoretical analysis. Yurong Chen: Supervision, Resources. Li Zhang: Supervision, Resources.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was jointly supported by National Natural Science Foundation of China (Grant No. 61132007, 61172125, 61601021, and U1533132).

References (46)

Bearman, A., Russakovsky, O., Ferrari, V., Fei-Fei, L., 2016. What’s the point: Semantic segmentation with point...
Caesar, H., Uijlings, J., Ferrari, V., 2018. Coco-stuff: Thing and stuff classes in context, in: CVPR...
Chan, A.B., Vasconcelos, N., 2008. Modeling, clustering, and segmenting video with mixtures of dynamic textures, in:...
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L., 2017. Deeplab: Semantic image segmentation with...
Dai, J., He, K., Sun, J., 2015. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic...
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A., 2010. The pascal visual object classes (voc)...
Farabet, C., Couprie, C., Najman, L., LeCun, Y., 2012. Learning hierarchical features for scene labeling, in: TPAMI...
Gal, Y., Ghahramani, Z., 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning,...
He, X., Zemel, R.S., Carreira-Perpiñán, M.Á., 2004. Multiscale conditional random fields for image labeling, in: CVPR...
Hoiem, D., Efros, A.A., Hebert, M., 2005. Geometric context from a single image, in: ICCV...

Jena, R., Awate, S.P., 2019. A bayesian neural net to segment images with uncertainty estimates and good calibration,...

Jepson, A., Black, M.J., 1993. Mixture models for optical flow computation, in: CVPR...

KendallA. et al.

Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding

(2015)

Kendall, A., Cipolla, R., 2016. Modelling uncertainty in deep learning for camera relocalization, in: ICRA...

Khan, S., Hayat, M., Zamir, S.W., Shen, J., Shao, L., 2019. Striking the right balance with uncertainty, in: CVPR...

Kohl, S., Romera-Paredes, B., Meyer, C., De Fauw, J., Ledsam, J.R., Maier-Hein, K., Eslami, S.A., Rezende, D.J.,...

Kolesnikov, A., Lampert, C.H., 2016. Seed, expand and constrain: Three principles for weakly-supervised image...

Lakshminarayanan, B., Pritzel, A., Blundell, C., 2017. Simple and scalable predictive uncertainty estimation using deep...

LiZ. et al.

Reducing over-confident errors outside the known distribution

(2018)

Lin, D., Dai, J., Jia, J., He, K., Sun, J., 2016. Scribblesup: Scribble-supervised convolutional networks for semantic...

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L., 2014. Microsoft...

Liu, C., Yuen, J., Torralba, A., 2011. Nonparametric scene parsing via label transfer, in: TPAMI...

Long, J., Shelhamer, E., Darrell, T., 2015. Fully convolutional networks for semantic segmentation, in: CVPR...

Cited by (6)

VIBUS: Data-efficient 3D scene parsing with VIewpoint Bottleneck and Uncertainty-Spectrum modeling
2022, ISPRS Journal of Photogrammetry and Remote Sensing
Citation Excerpt :
Gal and Ghahramani (2016) put forward a workaround to assess the uncertainty of deep neural networks by running the model with functional dropout layers for multiple rounds and calculating the variances of the logits. Recent work by Zhao et al. (2020) argues that dropout-based uncertainty is a valid criterion for label quality and achieves state-of-the-art performances of semantic segmentation on real-world 2D image datasets by applying the criterion to select reliable pseudo labels. Following this insight, we aim to leverage the class-wise dropout-based uncertainty of pseudo labels in 3D scenes, which was not well studied before this work.
Recently, 3D scenes parsing with deep learning approaches has been a heating topic. However, current methods with fully-supervised models require manually annotated point-wise supervision which is extremely user-unfriendly and time-consuming to obtain. As such, training 3D scene parsing models with sparse supervision is an intriguing alternative. We term this task as data-efficient 3D scene parsing and propose an effective two-stage framework named VIBUS to resolve it by exploiting the enormous unlabeled points. In the first stage, we perform self-supervised representation learning on unlabeled points with the proposed Viewpoint Bottleneck loss function. The loss function is derived from an information bottleneck objective imposed on scenes under different viewpoints, making the process of representation learning free of degradation and sampling. In the second stage, pseudo labels are harvested from the sparse labels based on uncertainty-spectrum modeling. By combining data-driven uncertainty measures and 3D mesh spectrum measures (derived from normal directions and geodesic distances), a robust local affinity metric is obtained. Finite gamma/beta mixture models are used to decompose category-wise distributions of these measures, leading to automatic selection of thresholds. We evaluate VIBUS on the public benchmark ScanNet and achieve state-of-the-art results on both validation set and online test server. Ablation studies show that both Viewpoint Bottleneck and uncertainty-spectrum modeling bring significant improvements. Codes and models are publicly available at https://github.com/AIR-DISCOVER/VIBUS.
Leveraging Transferable Knowledge Concept Graph Embedding for Cold-Start Cognitive Diagnosis
2023, SIGIR 2023 - Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
Delving into Shape-aware Zero-shot Semantic Segmentation
2023, arXiv
DPF: Learning Dense Prediction Fields with Weak Supervision
2023, arXiv
DQS3D: Densely-matched Quantization-aware Semi-supervised 3D Detection
2023, Proceedings of the IEEE International Conference on Computer Vision
From Semi-supervised to Omni-supervised Room Layout Estimation Using Point Clouds
2023, Proceedings - IEEE International Conference on Robotics and Automation

View full text