Skip to main content
Log in

Distribution-Aware Margin Calibration for Semantic Segmentation in Images

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

The Jaccard index, also known as Intersection-over-Union (IoU), is one of the most critical evaluation metrics in image semantic segmentation. However, direct optimization of IoU score is very difficult because the learning objective is neither differentiable nor decomposable. Although some algorithms have been proposed to optimize its surrogates, there is no guarantee provided for the generalization ability. In this paper, we propose a margin calibration method, which can be directly used as a learning objective, for an improved generalization of IoU over the data-distribution, underpinned by a rigid lower bound. This scheme theoretically ensures a better segmentation performance in terms of IoU score. We evaluated the effectiveness of the proposed margin calibration method on seven image datasets, showing substantial improvements in IoU score over other learning objectives using deep segmentation models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. See our PyTorch implementation at https://github.com/yutao1008/margin_calibration for more details.

  2. https://www.cityscapes-dataset.com/method-details/?submissionID=10089

References

  • Abraham, N., & Khan, N. M. (2019). A novel focal tversky loss function with improved attention u-net for lesion segmentation. In: ISBI pp. 683–687.

  • Ahmed, F., Tarlow, D., & Batra, D. (2015). Optimizing expected intersection-over-union with candidate-constrained crfs. In: ICCV, pp. 1850–1858.

  • Allan, M., Shvets, A., Kurmann, T., Zhang, Z., Duggal, R., Su, Y.H., Rieke, N., Laina, I., Kalavakonda, N., Bodenstedt, S., et al. (2017). robotic instrument segmentation challenge. CoRR

  • Berman, M., Rannen Triki, A., Blaschko, M.B. (2018). The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In: CVPR, pp. 4413–4421.

  • Blaschko, M. B., & Lampert, C. H. (2008). Learning to localize objects with structured output regression. In: ECCV, pp. 2–15.

  • Boser, B.E., Guyon, I.M., Vapnik, V.N. (1992). A training algorithm for optimal margin classifiers. In: Proceedings of the 5th annual workshop on Computational learning theory, pp. 144–152.

  • Cadena, C., & Košecká, J. (2014). Semantic segmentation with heterogeneous sensor coverages. In: ICRA, pp. 2639–2645.

  • Caesar, H., Uijlings, J., & Ferrari, V. (2018). Coco-stuff: Thing and stuff classes in context. In: CVPR, pp. 1209–1218.

  • Cao, K., Wei, C., Gaidon, A., Arechiga, N., & Ma, T. (2019). Learning imbalanced datasets with label-distribution-aware margin loss. In: NIPS, pp. 1567–1578.

  • Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine intelligence, 40(4), 834–848.

    Article  Google Scholar 

  • Chen, L. C., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. In: ECCV, pp. 801–818.

  • Cheng, B., Chen, L. C., Wei, Y., Zhu, Y., Huang, Z., Xiong, J., Huang, T. S., Hwu, W. M., & Shi, H. (2019). Spgnet: Semantic prediction guidance for scene parsing. In: ICCV, pp. 5218–5228.

  • Choi, S., Kim, J. T., & Choo, J. (2020). Cars can’t fly up in the sky: Improving urban-scene segmentation via height-driven attention networks. In: CVPR, pp. 9373–9383.

  • Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp.3213–3223.

  • Ding, H., Jiang, X., Shuai, B., Liu, A. Q., & Wang, G. (2018). Context contrasted feature and gated multi-scale aggregation for scene segmentation. In: CVPR, pp. 2393–2402.

  • Ding, H., Jiang, X., Shuai, B., Liu, A. Q., & Wang, G. (2020). Semantic segmentation with context encoding and multi-path decoding. IEEE Transactions on Image Processing, 29, 3520–3533.

    Article  Google Scholar 

  • Eelbode, T., Bertels, J., Berman, M., Vandermeulen, D., Maes, F., Bisschops, R., & Blaschko, M. B. (2020). Optimization for medical image segmentation: Theory and practice when evaluating with dice score or jaccard index. IEEE Transactions on Medical Imaging, 39(11), 3679–3690.

    Article  Google Scholar 

  • Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136.

    Article  Google Scholar 

  • Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., & Lu, H. (2019). Dual attention network for scene segmentation. In: CVPR, pp. 3146–3154.

  • Grabocka, J., Scholz, R., Schmidt-Thieme, L. (2019). Learning surrogate losses. CoRR

  • Hariharan, B., Arbeláez, P., Bourdev, L., Maji, S., & Malik, J. (2011). Semantic contours from inverse detectors. In: ICCV, pp. 991–998.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: CVPR, pp. 770–778.

  • Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In: CVPR, pp. 7132–7141.

  • Karimi, D., & Salcudean, S. E. (2019). Reducing the hausdorff distance in medical image segmentation with convolutional neural networks. IEEE Transactions on Medical Imaging, 39(2), 499–513.

    Article  Google Scholar 

  • Ke, T., Hwang, J., Liu, Z., & Yu, S. (2018). Adaptive affinity fields for semantic segmentation. In: ECCV, pp. 587–602.

  • Kervadec, H., Bouchtiba, J., Desrosiers, C., Granger, E., Dolz, J., & Ayed, I. B. (2019). Boundary loss for highly unbalanced segmentation. In: MIDL, pp. 285–296.

  • Khan, S., Hayat, M., Zamir, S. W., Shen, J., & Shao, L. (2019). Striking the right balance with uncertainty. In: CVPR, pp. 103–112.

  • Li, Y., Zaragoza, H., Herbrich, R., Shawe-Taylor, J., & Kandola, J. (2002). The perceptron algorithm with uneven margins. In: ICML, pp. 379–386.

  • Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In: CVPR, pp. 2980–2988.

  • Liu, X., Wang, Y., Wang, L., et al. (2019). Mcdiarmid-type inequalities for graph-dependent variables and stability bounds. In: NIPS, pp. 10890–10901.

  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440.

  • Loshchilov, I., Hutter, F. (2019). Decoupled weight decay regularization. In: ICLR.

  • Ma, J., Chen, J., Ng, M., Huang, R., Li, Y., Li, C., Yang, X., & Martel, A. L. (2021). Loss odyssey in medical image segmentation. Medical Image Analysis, 71, 102035.

    Article  Google Scholar 

  • Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2018). Foundations of machine learning. London: MIT Press.

    MATH  Google Scholar 

  • Nagendar, G., Singh, D., Balasubramanian, V.N., Jawahar, C. (2018). Neuro-iou: Learning a surrogate loss for semantic segmentation. In: BMVC, p. 278.

  • Neuhold, G., Ollmann, T., Rota Bulo, S., & Kontschieder, P. (2017). The mapillary vistas dataset for semantic understanding of street scenes. In: CVPR, pp. 4990–4999.

  • Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y., Srebro, N. (2018). The role of over-parametrization in generalization of neural networks. In: ICLR.

  • Nowozin, S. (2014). Optimal decisions from probabilistic models: The intersection-over-union case. In: CVPR, pp. 548–555.

  • Rahman, M. A., & Wang, Y. (2016). Optimizing intersection-over-union in deep neural networks for image segmentation. In: International symposium on visual computing, pp. 234–244.

  • Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In: MICCAI, pp. 234–241.

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.

    Article  MathSciNet  Google Scholar 

  • Salehi, S. S. M., Erdogmus, D., & Gholipour, A. (2017). Tversky loss function for image segmentation using 3d fully convolutional deep networks. In: International Workshop on Machine Learning in Medical Imaging, pp. 379–387.

  • Shen, D., Ji, Y., Li, P., Wang, Y., Lin, D. (2020). Ranet: Region attention network for semantic segmentation. In: NIPS.

  • Sudre, C. H., Li, W., Vercauteren, T., Ourselin, S., & Cardoso, M. J. (2017). Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In: Deep learning in medical image analysis and multimodal learning for clinical decision support, pp. 240–248.

  • Sun, C., Shrivastava, A., Singh, S., & Gupta, A. (2017). Revisiting unreasonable effectiveness of data in deep learning era. In: CVPR, pp. 843–852.

  • Wang, G., Liu, X., Li, C., Xu, Z., Ruan, J., Zhu, H., Meng, T., Li, K., Huang, N., & Zhang, S. (2020). A noise-robust framework for automatic segmentation of covid-19 pneumonia lesions from ct images. IEEE Transactions on Medical Imaging, 39(8), 2653–2663.

    Article  Google Scholar 

  • Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., Wang, X., et al. (2020). Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern analysis and Machine Intelligence, 43(10), 3349–3364.

    Article  Google Scholar 

  • Wang, L., Li, D., Zhu, Y., Tian, L., & Shan, Y. (2020). Dual super-resolution learning for semantic segmentation. In: CVPR, pp. 3774–3783.

  • Wong, K. C., Moradi, M., Tang, H., & Syeda-Mahmood, T. (2018). 3d segmentation with exponential logarithmic loss for highly unbalanced object sizes. In: MICCAI, pp. -619.

  • Xiao, J., & Quan, L. (2009). Multiple view semantic segmentation for street view images. In ICCV, pp. -693.

  • Xiao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018). Unified perceptual parsing for scene understanding. In: ECCV pp. 418–334

  • Xu, D., Ouyang, W., Wang, X., & Sebe, N. (2018). Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In CVPR p. 675–684

  • Xuhong, L., Grandvalet, Y., & Davoine, F. (2018). Explicit inductive bias for transfer learning with convolutional networks. In: ICML,pp. 2825–2834

  • Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V., & Darrell, T. (2020). Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In: CVPR, pp. 2636–2645

  • Yu, F., Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. CoRR.

  • Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In: CVPR, pp. 2881–2890

  • Zhao, H., Zhang, Y., Liu, S., Shi, J., Change Loy, C., Lin, D., & Jia, J. (2018). Psanet: Point-wise spatial attention network for scene parsing. In: ECCV pp.267-283.

  • Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ade20k dataset. In: CVPR, pp. 633–641.

  • Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., & Torralba, A. (2019). Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127(3), 302–321.

    Article  Google Scholar 

Download references

Acknowledgements

This research work is supported by Multimedia Data Analytics Lab of Global Big Data Technologies Centre (GBDTC), University of Technology Sydney (UTS). The experimental environment is provided by UTS Tech Lab. Zhibin Li acknowledges the support from UTS to conduct this research, and the continued support from CSIRO’s Machine Learning and Artificial Intelligence Future Science Platform.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jian Zhang.

Additional information

Communicated by Zhouchen Lin. Litao Yu and Zhibin Li equally contributed to this work.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A Appendix

A Appendix

1.1 A.1 Proof of Theorem 1

We first prove with probability \(1-\frac{\eta }{K}\), the following inequality holds:

(19)

Assume the following inequality holds for non-negative \(\epsilon _{k0}\) and \(\epsilon _{0k}\):

(20)

Solving the above inequality, we can get:

$$\begin{aligned} \epsilon _k = (\frac{a_k}{b_k}\epsilon _{0k}+\epsilon _{k0})(b_k-\epsilon _{0k})^{-1}, \end{aligned}$$
(21)

where and .

Next, we should get the values of \(\epsilon _{k0}\) and \(\epsilon _{0k}\) to calculate \(\epsilon _k\), which should satisfy the following inequality:

(22)

so we can simply substitute (22) into (20) to complete the proof.

The empirical label distribution \(P_k\) is irrelevant to the model \(\theta \) and we assume it is an accurate estimation of the label distribution \(\mathcal {D}_{\mathcal {Y}}\), i.e.,

(23)

Based on the above approximation, a sufficient condition for (22) regarding \(\epsilon _{0k}\) and \(\epsilon _{k0}\) is:

(24)

Following the margin-based generalization bound in Mohri et al. (2018)[Theorem 9.2], for the \(N_k\) foreground class pixels, with the probability at least \(1-\frac{\eta }{2K}\), we have:

(25)

where \(\mathfrak {R}_{N_k}(\Theta )\) is the Rademacher complexity for the hypothesis class \(\Theta \) over the \(N_k\) pixels of the k-th foreground class. For an input data batch with M pixels, we first apply the McDiarmid’s inequality for M-dependent data Liu et al. (2019) to the proof of Mohri et al. (2018)[Theorem 3.3]. Then we use it in the proof of Mohri et al. (2018)[Theorem 9.2] to get the formulation of (25).

The Rademacher complexity \(\mathfrak {R}_{N_k}(\Theta )\) typically scales in \(\sqrt{\frac{C(\Theta )}{N_k}}\) with \(C(\Theta )\) being the a proper complexity measure of \(\Theta \) Neyshabur et al. (2018), and such a scale has also been used in related work (see Cao et al. (2019) and the references therein). We can then rewrite (25) as:

(26)

where \(\sigma (\frac{1}{\eta }) \triangleq \frac{\rho _{\max }}{4K} \sqrt{2M\log \frac{2K}{\eta }}\) is typically a low-order term in \(\frac{1}{\eta }\).

Similarly, for the \(N-N_k\) pixels of the background class, with the probability at least \(1-\frac{\eta }{2}\),

(27)

for the \(N-N_k\) background class pixels.

We then combine (26), (27), (24) and take a union bound over \(\epsilon _{k0}\) and \(\epsilon _{0k}\), to get following equations, with which (24) holds with the probability at least \(1-\eta /K\):

$$\begin{aligned} \epsilon _{k0}&= \frac{\sqrt{N_k}}{N} \frac{4K}{\rho _{k0}}\mathcal {F},\nonumber \\ \epsilon _{0k}&= \frac{\sqrt{N-N_k}}{N}\frac{4K}{\rho _{0k}} \mathcal {F}. \end{aligned}$$
(28)

Then we substitute above equations into (21). Let \(\mu _k = \frac{\rho _{k0}}{\rho _{0k}}\), we have:

$$\begin{aligned} \epsilon _k = \frac{\frac{a_k}{b_k}\sqrt{N-N_k} + \frac{\sqrt{N_k}}{\mu _k}}{\frac{b_k N}{4K\mathcal {F}}\rho _{0k}-\sqrt{N-N_k}}, \end{aligned}$$
(29)

so that with the probability at least \(1-\eta /K\) the inequality (13) holds.

In practice, we do not know the values of \(a_k\) and \(b_k\) so that Eq.(29) has its own limitations. However, we know \(\frac{a_k}{b_k}\le 1\) and so we can get a very useful bound:

$$\begin{aligned} \epsilon _k \le \frac{\sqrt{N-N_k} + \frac{\sqrt{N_k}}{\mu _k}}{\frac{N_k}{4K\mathcal {F}}\rho _{0k}-\sqrt{N-N_k}}. \end{aligned}$$
(30)

Averaging the union bound Eq.(13) over all classes, we can obtain the following inequality with probability at least \(1-\eta \):

(31)

with

$$\begin{aligned} \epsilon = \frac{1}{K}\sum \limits _{k=1}^{K} \frac{ \sqrt{N-N_k}+ \frac{\sqrt{N_k}}{\mu _k}}{\frac{N_k}{4K\mathcal {F}}\rho _{0k}-\sqrt{N-N_k}}, \end{aligned}$$
(32)

where we complete the proof.

1.2 A.2 Proof of Colollary 1

We substitute \(\mu _k\) in (32) with \(\frac{\sqrt{N_k}}{r(N/N_k\!-\!1)-\sqrt{N\!-\!N_k}}\), where r is a hyper-parameter, we can get:

$$\begin{aligned} \epsilon&= \frac{1}{K}\sum _{k=1}^{K} (\frac{r(N-N_k)}{N_k})(\frac{N_k}{4K\mathcal {F}}\rho _{0k}-\sqrt{N-N_k})^{-1} \nonumber \\&= \frac{1}{K}\sum _{k=1}^{K} (\frac{r(N-N_k)}{N_k^2})(\frac{1}{4K\mathcal {F}}\rho _{0k}-\frac{\sqrt{N-N_k}}{N_k})^{-1}. \end{aligned}$$
(33)

Let \(x_k = \frac{r(N-N_k)}{N_k^2} \) and \(y_k = \frac{1}{4K\mathcal {F}}\rho _{0k}-\frac{\sqrt{N-N_k}}{N_k}\). According to Cauchy-Schwarz inequality we have:

$$\begin{aligned} \left( \sum _{k=1}^{K} \sqrt{\frac{x_k}{y_k}}\cdot \sqrt{y_k}\right) ^2 \le (\sum _{k=1}^{K}\frac{x_k}{y_k}) (\sum _{k=1}^{K} y_k), \end{aligned}$$
(34)

so that

$$\begin{aligned} \epsilon&\ge \frac{1}{K}\cdot \left( \sum \limits _{k=1}^{K} \sqrt{x_k}\right) ^2 (\sum \limits _{k=1}^{K} y_k)^{-1} \nonumber \\&= \frac{r}{K}\cdot \frac{\left( \sum \limits _{k=1}^{K} \frac{\sqrt{N-N_k}}{N_k}\right) ^2}{\frac{1}{4K\mathcal {F}}\sum \limits _{k=1}^{K} \rho _{0k}- \sum \limits _{k=1}^{K}\frac{\sqrt{N-N_k}}{N_k}}. \end{aligned}$$
(35)

The RHS of the equality is a constant because r is a given hyper parameter and we assume \(\sum \limits _{k=1}^{K} \rho _{0k} = \text {some contant}\). The equality holds when \(\frac{\sqrt{x_1}}{y_1}=\ldots =\frac{\sqrt{x_K}}{y_K}\), which yields Corollary 1.

Note that \(\mu _k = \frac{\sqrt{N_k}}{r(N/N_k-1)-\sqrt{N-N_k}}\), while in Corollary 1\(\mu _k = \frac{P_k\sqrt{N_k}}{\upsilon (N-N_k)-P_k\sqrt{N-N_k}}\). These two conditions are essentially equivalent when r and \(\upsilon \) are hyper-parameters. To see this, simply let \(r=N\upsilon \) and notice that \(P_k=\frac{N_k}{N}\).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, L., Li, Z., Xu, M. et al. Distribution-Aware Margin Calibration for Semantic Segmentation in Images. Int J Comput Vis 130, 95–110 (2022). https://doi.org/10.1007/s11263-021-01533-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-021-01533-0

Keywords

Navigation