Real-time Semantic Segmentation with Fast Attention,arXiv - CS - Multimedia

当前位置： X-MOL 学术 › arXiv.cs.MM › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Real-time Semantic Segmentation with Fast Attention
arXiv - CS - Multimedia Pub Date : 2020-07-07 , DOI: arxiv-2007.03815
Ping Hu, Federico Perazzi, Fabian Caba Heilbron, Oliver Wang, Zhe Lin, Kate Saenko, Stan Sclaroff

In deep CNN based models for semantic segmentation, high accuracy relies on rich spatial context (large receptive fields) and fine spatial details (high resolution), both of which incur high computational costs. In this paper, we propose a novel architecture that addresses both challenges and achieves state-of-the-art performance for semantic segmentation of high-resolution images and videos in real-time. The proposed architecture relies on our fast spatial attention, which is a simple yet efficient modification of the popular self-attention mechanism and captures the same rich spatial context at a small fraction of the computational cost, by changing the order of operations. Moreover, to efficiently process high-resolution input, we apply an additional spatial reduction to intermediate feature stages of the network with minimal loss in accuracy thanks to the use of the fast attention module to fuse features. We validate our method with a series of experiments, and show that results on multiple datasets demonstrate superior performance with better accuracy and speed compared to existing approaches for real-time semantic segmentation. On Cityscapes, our network achieves 74.4$\%$ mIoU at 72 FPS and 75.5$\%$ mIoU at 58 FPS on a single Titan X GPU, which is~$\sim$50$\%$ faster than the state-of-the-art while retaining the same accuracy.

中文翻译：

具有快速注意力的实时语义分割

在基于深度 CNN 的语义分割模型中，高精度依赖于丰富的空间上下文（大感受野）和精细的空间细节（高分辨率），这两者都会产生高计算成本。在本文中，我们提出了一种新颖的架构，该架构解决了高分辨率图像和视频的实时语义分割挑战并实现了最先进的性能。所提出的架构依赖于我们的快速空间注意力，这是对流行的自注意力机制的简单而有效的修改，并通过改变操作顺序以一小部分计算成本捕获相同的丰富空间上下文。此外，为了有效地处理高分辨率输入，由于使用了快速注意力模块来融合特征，我们对网络的中间特征阶段应用了额外的空间缩减，同时精度损失最小。我们通过一系列实验验证了我们的方法，并表明与现有的实时语义分割方法相比，在多个数据集上的结果表现出卓越的性能，具有更好的准确性和速度。在 Cityscapes 上，我们的网络在单个 Titan X GPU 上以 72 FPS 的速度实现了 74.4$\%$ mIoU，在 58 FPS 的速度下实现了 75.5$\%$ mIoU，这比以下状态快了~$\sim$50$\%$艺术，同时保持相同的准确性。并表明与现有的实时语义分割方法相比，在多个数据集上的结果表现出卓越的性能，具有更好的准确性和速度。在 Cityscapes 上，我们的网络在单个 Titan X GPU 上以 72 FPS 的速度实现了 74.4$\%$ mIoU，在 58 FPS 的速度下实现了 75.5$\%$ mIoU，这比以下状态快~$\sim$50$\%$艺术，同时保持相同的准确性。并表明与现有的实时语义分割方法相比，在多个数据集上的结果表现出卓越的性能，具有更好的准确性和速度。在 Cityscapes 上，我们的网络在单个 Titan X GPU 上以 72 FPS 的速度实现了 74.4$\%$ mIoU，在 58 FPS 的速度下实现了 75.5$\%$ mIoU，这比以下状态快~$\sim$50$\%$艺术，同时保持相同的准确性。

更新日期：2020-07-13

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>