Distribution-Aware Margin Calibration for Semantic Segmentation in Images

Yu, Litao; Li, Zhibin; Xu, Min; Gao, Yongsheng; Luo, Jiebo; Zhang, Jian

doi:10.1007/s11263-021-01533-0

Distribution-Aware Margin Calibration for Semantic Segmentation in Images

Published: 09 November 2021

Volume 130, pages 95–110, (2022)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Litao Yu¹,
Zhibin Li²,
Min Xu¹,
Yongsheng Gao³,
Jiebo Luo⁴ &
…
Jian Zhang ORCID: orcid.org/0000-0002-7240-3541¹

1590 Accesses
7 Citations
2 Altmetric
Explore all metrics

Abstract

The Jaccard index, also known as Intersection-over-Union (IoU), is one of the most critical evaluation metrics in image semantic segmentation. However, direct optimization of IoU score is very difficult because the learning objective is neither differentiable nor decomposable. Although some algorithms have been proposed to optimize its surrogates, there is no guarantee provided for the generalization ability. In this paper, we propose a margin calibration method, which can be directly used as a learning objective, for an improved generalization of IoU over the data-distribution, underpinned by a rigid lower bound. This scheme theoretically ensures a better segmentation performance in terms of IoU score. We evaluated the effectiveness of the proposed margin calibration method on seven image datasets, showing substantial improvements in IoU score over other learning objectives using deep segmentation models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Max-Margin Learning of Deep Structured Models for Semantic Segmentation

Global Boundary Refinement for Semantic Segmentation via Optimal Transport

Splitting Vs. Merging: Mining Object Regions with Discrepancy and Intersection Loss for Weakly Supervised Semantic Segmentation

Notes

See our PyTorch implementation at https://github.com/yutao1008/margin_calibration for more details.
https://www.cityscapes-dataset.com/method-details/?submissionID=10089

References

Abraham, N., & Khan, N. M. (2019). A novel focal tversky loss function with improved attention u-net for lesion segmentation. In: ISBI pp. 683–687.
Ahmed, F., Tarlow, D., & Batra, D. (2015). Optimizing expected intersection-over-union with candidate-constrained crfs. In: ICCV, pp. 1850–1858.
Allan, M., Shvets, A., Kurmann, T., Zhang, Z., Duggal, R., Su, Y.H., Rieke, N., Laina, I., Kalavakonda, N., Bodenstedt, S., et al. (2017). robotic instrument segmentation challenge. CoRR
Berman, M., Rannen Triki, A., Blaschko, M.B. (2018). The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In: CVPR, pp. 4413–4421.
Blaschko, M. B., & Lampert, C. H. (2008). Learning to localize objects with structured output regression. In: ECCV, pp. 2–15.
Boser, B.E., Guyon, I.M., Vapnik, V.N. (1992). A training algorithm for optimal margin classifiers. In: Proceedings of the 5th annual workshop on Computational learning theory, pp. 144–152.
Cadena, C., & Košecká, J. (2014). Semantic segmentation with heterogeneous sensor coverages. In: ICRA, pp. 2639–2645.
Caesar, H., Uijlings, J., & Ferrari, V. (2018). Coco-stuff: Thing and stuff classes in context. In: CVPR, pp. 1209–1218.
Cao, K., Wei, C., Gaidon, A., Arechiga, N., & Ma, T. (2019). Learning imbalanced datasets with label-distribution-aware margin loss. In: NIPS, pp. 1567–1578.
Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine intelligence, 40(4), 834–848.
Article Google Scholar
Chen, L. C., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. In: ECCV, pp. 801–818.
Cheng, B., Chen, L. C., Wei, Y., Zhu, Y., Huang, Z., Xiong, J., Huang, T. S., Hwu, W. M., & Shi, H. (2019). Spgnet: Semantic prediction guidance for scene parsing. In: ICCV, pp. 5218–5228.
Choi, S., Kim, J. T., & Choo, J. (2020). Cars can’t fly up in the sky: Improving urban-scene segmentation via height-driven attention networks. In: CVPR, pp. 9373–9383.
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp.3213–3223.
Ding, H., Jiang, X., Shuai, B., Liu, A. Q., & Wang, G. (2018). Context contrasted feature and gated multi-scale aggregation for scene segmentation. In: CVPR, pp. 2393–2402.
Ding, H., Jiang, X., Shuai, B., Liu, A. Q., & Wang, G. (2020). Semantic segmentation with context encoding and multi-path decoding. IEEE Transactions on Image Processing, 29, 3520–3533.
Article Google Scholar
Eelbode, T., Bertels, J., Berman, M., Vandermeulen, D., Maes, F., Bisschops, R., & Blaschko, M. B. (2020). Optimization for medical image segmentation: Theory and practice when evaluating with dice score or jaccard index. IEEE Transactions on Medical Imaging, 39(11), 3679–3690.
Article Google Scholar
Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136.
Article Google Scholar
Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., & Lu, H. (2019). Dual attention network for scene segmentation. In: CVPR, pp. 3146–3154.
Grabocka, J., Scholz, R., Schmidt-Thieme, L. (2019). Learning surrogate losses. CoRR
Hariharan, B., Arbeláez, P., Bourdev, L., Maji, S., & Malik, J. (2011). Semantic contours from inverse detectors. In: ICCV, pp. 991–998.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: CVPR, pp. 770–778.
Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In: CVPR, pp. 7132–7141.
Karimi, D., & Salcudean, S. E. (2019). Reducing the hausdorff distance in medical image segmentation with convolutional neural networks. IEEE Transactions on Medical Imaging, 39(2), 499–513.
Article Google Scholar
Ke, T., Hwang, J., Liu, Z., & Yu, S. (2018). Adaptive affinity fields for semantic segmentation. In: ECCV, pp. 587–602.
Kervadec, H., Bouchtiba, J., Desrosiers, C., Granger, E., Dolz, J., & Ayed, I. B. (2019). Boundary loss for highly unbalanced segmentation. In: MIDL, pp. 285–296.
Khan, S., Hayat, M., Zamir, S. W., Shen, J., & Shao, L. (2019). Striking the right balance with uncertainty. In: CVPR, pp. 103–112.
Li, Y., Zaragoza, H., Herbrich, R., Shawe-Taylor, J., & Kandola, J. (2002). The perceptron algorithm with uneven margins. In: ICML, pp. 379–386.
Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In: CVPR, pp. 2980–2988.
Liu, X., Wang, Y., Wang, L., et al. (2019). Mcdiarmid-type inequalities for graph-dependent variables and stability bounds. In: NIPS, pp. 10890–10901.
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440.
Loshchilov, I., Hutter, F. (2019). Decoupled weight decay regularization. In: ICLR.
Ma, J., Chen, J., Ng, M., Huang, R., Li, Y., Li, C., Yang, X., & Martel, A. L. (2021). Loss odyssey in medical image segmentation. Medical Image Analysis, 71, 102035.
Article Google Scholar
Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2018). Foundations of machine learning. London: MIT Press.
MATH Google Scholar
Nagendar, G., Singh, D., Balasubramanian, V.N., Jawahar, C. (2018). Neuro-iou: Learning a surrogate loss for semantic segmentation. In: BMVC, p. 278.
Neuhold, G., Ollmann, T., Rota Bulo, S., & Kontschieder, P. (2017). The mapillary vistas dataset for semantic understanding of street scenes. In: CVPR, pp. 4990–4999.
Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y., Srebro, N. (2018). The role of over-parametrization in generalization of neural networks. In: ICLR.
Nowozin, S. (2014). Optimal decisions from probabilistic models: The intersection-over-union case. In: CVPR, pp. 548–555.
Rahman, M. A., & Wang, Y. (2016). Optimizing intersection-over-union in deep neural networks for image segmentation. In: International symposium on visual computing, pp. 234–244.
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In: MICCAI, pp. 234–241.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
Article MathSciNet Google Scholar
Salehi, S. S. M., Erdogmus, D., & Gholipour, A. (2017). Tversky loss function for image segmentation using 3d fully convolutional deep networks. In: International Workshop on Machine Learning in Medical Imaging, pp. 379–387.
Shen, D., Ji, Y., Li, P., Wang, Y., Lin, D. (2020). Ranet: Region attention network for semantic segmentation. In: NIPS.
Sudre, C. H., Li, W., Vercauteren, T., Ourselin, S., & Cardoso, M. J. (2017). Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In: Deep learning in medical image analysis and multimodal learning for clinical decision support, pp. 240–248.
Sun, C., Shrivastava, A., Singh, S., & Gupta, A. (2017). Revisiting unreasonable effectiveness of data in deep learning era. In: CVPR, pp. 843–852.
Wang, G., Liu, X., Li, C., Xu, Z., Ruan, J., Zhu, H., Meng, T., Li, K., Huang, N., & Zhang, S. (2020). A noise-robust framework for automatic segmentation of covid-19 pneumonia lesions from ct images. IEEE Transactions on Medical Imaging, 39(8), 2653–2663.
Article Google Scholar
Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., Wang, X., et al. (2020). Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern analysis and Machine Intelligence, 43(10), 3349–3364.
Article Google Scholar
Wang, L., Li, D., Zhu, Y., Tian, L., & Shan, Y. (2020). Dual super-resolution learning for semantic segmentation. In: CVPR, pp. 3774–3783.
Wong, K. C., Moradi, M., Tang, H., & Syeda-Mahmood, T. (2018). 3d segmentation with exponential logarithmic loss for highly unbalanced object sizes. In: MICCAI, pp. -619.
Xiao, J., & Quan, L. (2009). Multiple view semantic segmentation for street view images. In ICCV, pp. -693.
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018). Unified perceptual parsing for scene understanding. In: ECCV pp. 418–334
Xu, D., Ouyang, W., Wang, X., & Sebe, N. (2018). Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In CVPR p. 675–684
Xuhong, L., Grandvalet, Y., & Davoine, F. (2018). Explicit inductive bias for transfer learning with convolutional networks. In: ICML,pp. 2825–2834
Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V., & Darrell, T. (2020). Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In: CVPR, pp. 2636–2645
Yu, F., Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. CoRR.
Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In: CVPR, pp. 2881–2890
Zhao, H., Zhang, Y., Liu, S., Shi, J., Change Loy, C., Lin, D., & Jia, J. (2018). Psanet: Point-wise spatial attention network for scene parsing. In: ECCV pp.267-283.
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ade20k dataset. In: CVPR, pp. 633–641.
Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., & Torralba, A. (2019). Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127(3), 302–321.
Article Google Scholar

Download references

Acknowledgements

This research work is supported by Multimedia Data Analytics Lab of Global Big Data Technologies Centre (GBDTC), University of Technology Sydney (UTS). The experimental environment is provided by UTS Tech Lab. Zhibin Li acknowledges the support from UTS to conduct this research, and the continued support from CSIRO’s Machine Learning and Artificial Intelligence Future Science Platform.

Author information

Authors and Affiliations

University of Technology Sydney, Sydney, NSW, Australia
Litao Yu, Min Xu & Jian Zhang
CSIRO, Brisbane, QLD, Australia
Zhibin Li
Griffith University, Brisbane, QLD, Australia
Yongsheng Gao
Rochester University, Rochester, NY, USA
Jiebo Luo

Authors

Litao Yu
View author publications
You can also search for this author in PubMed Google Scholar
Zhibin Li
View author publications
You can also search for this author in PubMed Google Scholar
Min Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yongsheng Gao
View author publications
You can also search for this author in PubMed Google Scholar
Jiebo Luo
View author publications
You can also search for this author in PubMed Google Scholar
Jian Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jian Zhang.

Additional information

Communicated by Zhouchen Lin. Litao Yu and Zhibin Li equally contributed to this work.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A Appendix

1.1 A.1 Proof of Theorem 1

We first prove with probability $1-\frac{\eta }{K}$, the following inequality holds:

(19)

Assume the following inequality holds for non-negative $\epsilon _{k0}$ and $\epsilon _{0k}$:

(20)

Solving the above inequality, we can get:

$$\begin{aligned} \epsilon _k = (\frac{a_k}{b_k}\epsilon _{0k}+\epsilon _{k0})(b_k-\epsilon _{0k})^{-1}, \end{aligned}$$

(21)

where and .

Next, we should get the values of $\epsilon _{k0}$ and $\epsilon _{0k}$ to calculate $\epsilon _k$, which should satisfy the following inequality:

(22)

so we can simply substitute (22) into (20) to complete the proof.

The empirical label distribution $P_k$ is irrelevant to the model $\theta $ and we assume it is an accurate estimation of the label distribution $\mathcal {D}_{\mathcal {Y}}$, i.e.,

(23)

Based on the above approximation, a sufficient condition for (22) regarding $\epsilon _{0k}$ and $\epsilon _{k0}$ is:

(24)

Following the margin-based generalization bound in Mohri et al. (2018)[Theorem 9.2], for the $N_k$ foreground class pixels, with the probability at least $1-\frac{\eta }{2K}$, we have:

(25)

where $\mathfrak {R}_{N_k}(\Theta )$ is the Rademacher complexity for the hypothesis class $\Theta $ over the $N_k$ pixels of the k-th foreground class. For an input data batch with M pixels, we first apply the McDiarmid’s inequality for M-dependent data Liu et al. (2019) to the proof of Mohri et al. (2018)[Theorem 3.3]. Then we use it in the proof of Mohri et al. (2018)[Theorem 9.2] to get the formulation of (25).

The Rademacher complexity $\mathfrak {R}_{N_k}(\Theta )$ typically scales in $\sqrt{\frac{C(\Theta )}{N_k}}$ with $C(\Theta )$ being the a proper complexity measure of $\Theta $ Neyshabur et al. (2018), and such a scale has also been used in related work (see Cao et al. (2019) and the references therein). We can then rewrite (25) as:

(26)

where $\sigma (\frac{1}{\eta }) \triangleq \frac{\rho _{\max }}{4K} \sqrt{2M\log \frac{2K}{\eta }}$ is typically a low-order term in $\frac{1}{\eta }$.

Similarly, for the $N-N_k$ pixels of the background class, with the probability at least $1-\frac{\eta }{2}$,

(27)

for the $N-N_k$ background class pixels.

We then combine (26), (27), (24) and take a union bound over $\epsilon _{k0}$ and $\epsilon _{0k}$, to get following equations, with which (24) holds with the probability at least $1-\eta /K$:

$$\begin{aligned} \epsilon _{k0}&= \frac{\sqrt{N_k}}{N} \frac{4K}{\rho _{k0}}\mathcal {F},\nonumber \\ \epsilon _{0k}&= \frac{\sqrt{N-N_k}}{N}\frac{4K}{\rho _{0k}} \mathcal {F}. \end{aligned}$$

(28)

Then we substitute above equations into (21). Let $\mu _k = \frac{\rho _{k0}}{\rho _{0k}}$, we have:

$$\begin{aligned} \epsilon _k = \frac{\frac{a_k}{b_k}\sqrt{N-N_k} + \frac{\sqrt{N_k}}{\mu _k}}{\frac{b_k N}{4K\mathcal {F}}\rho _{0k}-\sqrt{N-N_k}}, \end{aligned}$$

(29)

so that with the probability at least $1-\eta /K$ the inequality (13) holds.

In practice, we do not know the values of $a_k$ and $b_k$ so that Eq.(29) has its own limitations. However, we know $\frac{a_k}{b_k}\le 1$ and so we can get a very useful bound:

$$\begin{aligned} \epsilon _k \le \frac{\sqrt{N-N_k} + \frac{\sqrt{N_k}}{\mu _k}}{\frac{N_k}{4K\mathcal {F}}\rho _{0k}-\sqrt{N-N_k}}. \end{aligned}$$

(30)

Averaging the union bound Eq.(13) over all classes, we can obtain the following inequality with probability at least $1-\eta $:

(31)

with

$$\begin{aligned} \epsilon = \frac{1}{K}\sum \limits _{k=1}^{K} \frac{ \sqrt{N-N_k}+ \frac{\sqrt{N_k}}{\mu _k}}{\frac{N_k}{4K\mathcal {F}}\rho _{0k}-\sqrt{N-N_k}}, \end{aligned}$$

(32)

where we complete the proof.

1.2 A.2 Proof of Colollary 1

We substitute $\mu _k$ in (32) with $\frac{\sqrt{N_k}}{r(N/N_k\!-\!1)-\sqrt{N\!-\!N_k}}$, where r is a hyper-parameter, we can get:

$$\begin{aligned} \epsilon&= \frac{1}{K}\sum _{k=1}^{K} (\frac{r(N-N_k)}{N_k})(\frac{N_k}{4K\mathcal {F}}\rho _{0k}-\sqrt{N-N_k})^{-1} \nonumber \\&= \frac{1}{K}\sum _{k=1}^{K} (\frac{r(N-N_k)}{N_k^2})(\frac{1}{4K\mathcal {F}}\rho _{0k}-\frac{\sqrt{N-N_k}}{N_k})^{-1}. \end{aligned}$$

(33)

Let $x_k = \frac{r(N-N_k)}{N_k^2} $ and $y_k = \frac{1}{4K\mathcal {F}}\rho _{0k}-\frac{\sqrt{N-N_k}}{N_k}$. According to Cauchy-Schwarz inequality we have:

$$\begin{aligned} \left( \sum _{k=1}^{K} \sqrt{\frac{x_k}{y_k}}\cdot \sqrt{y_k}\right) ^2 \le (\sum _{k=1}^{K}\frac{x_k}{y_k}) (\sum _{k=1}^{K} y_k), \end{aligned}$$

(34)

so that

$$\begin{aligned} \epsilon&\ge \frac{1}{K}\cdot \left( \sum \limits _{k=1}^{K} \sqrt{x_k}\right) ^2 (\sum \limits _{k=1}^{K} y_k)^{-1} \nonumber \\&= \frac{r}{K}\cdot \frac{\left( \sum \limits _{k=1}^{K} \frac{\sqrt{N-N_k}}{N_k}\right) ^2}{\frac{1}{4K\mathcal {F}}\sum \limits _{k=1}^{K} \rho _{0k}- \sum \limits _{k=1}^{K}\frac{\sqrt{N-N_k}}{N_k}}. \end{aligned}$$

(35)

The RHS of the equality is a constant because r is a given hyper parameter and we assume $\sum \limits _{k=1}^{K} \rho _{0k} = \text {some contant}$. The equality holds when $\frac{\sqrt{x_1}}{y_1}=\ldots =\frac{\sqrt{x_K}}{y_K}$, which yields Corollary 1.

Note that $\mu _k = \frac{\sqrt{N_k}}{r(N/N_k-1)-\sqrt{N-N_k}}$, while in Corollary 1$\mu _k = \frac{P_k\sqrt{N_k}}{\upsilon (N-N_k)-P_k\sqrt{N-N_k}}$. These two conditions are essentially equivalent when r and $\upsilon $ are hyper-parameters. To see this, simply let $r=N\upsilon $ and notice that $P_k=\frac{N_k}{N}$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yu, L., Li, Z., Xu, M. et al. Distribution-Aware Margin Calibration for Semantic Segmentation in Images. Int J Comput Vis 130, 95–110 (2022). https://doi.org/10.1007/s11263-021-01533-0

Download citation

Received: 24 January 2021
Accepted: 10 September 2021
Published: 09 November 2021
Issue Date: January 2022
DOI: https://doi.org/10.1007/s11263-021-01533-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distribution-Aware Margin Calibration for Semantic Segmentation in Images

Abstract

Access this article

Similar content being viewed by others

Max-Margin Learning of Deep Structured Models for Semantic Segmentation

Global Boundary Refinement for Semantic Segmentation via Optimal Transport

Splitting Vs. Merging: Mining Object Regions with Discrepancy and Intersection Loss for Weakly Supervised Semantic Segmentation

Notes

References

Acknowledgements