Efficient pyramid context encoding and feature embedding for semantic segmentation,Image and Vision Computing

当前位置： X-MOL 学术 › Image Vis. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Efficient pyramid context encoding and feature embedding for semantic segmentation
Image and Vision Computing ( IF 4.2 ) Pub Date : 2021-05-08 , DOI: 10.1016/j.imavis.2021.104195
Mengyu Liu , Hujun Yin

For reality applications of semantic segmentation, inference speed and memory usage are two important factors. To address these challenges, we propose a lightweight feature pyramid encoding network (FPENet) for semantic segmentation with a good trade-off between accuracy and speed. We use a series of feature pyramid encoding (FPE) blocks to encode context at multiple scales in the encoder. Each FPE block consists of different depthwise dilated convolutions that perform as a spatial pyramid to extract features and reduce computational costs. During training, a one-shot neural architecture search algorithm is adopted to find the optimal structure for each FPE block from a large search space with a small search cost. After the search for the encoder, a mutual embedding upsample module is introduced in the decoder, consisting of two attention blocks. The encoder-decoder attention mechanism is used to help aggregate efficiently high-level semantic features and low-level spatial details. The proposed network outperforms the existing real-time methods with fewer parameters and improved inference speed on the Cityscapes and CamVid benchmark datasets. Specifically, it achieved 72.3% mean IoU on the Cityscapes test set with only 0.4 M parameters and 192.6 FPS speed on an Nvidia Titan V100 GPU, and 73.4% mean IoU with 116.2 FPS when running on higher resolution images.

中文翻译：

高效的金字塔上下文编码和特征嵌入，用于语义分割

对于语义分段的实际应用，推理速度和内存使用是两个重要因素。为了解决这些挑战，我们提出了一种用于语义分割的轻量级特征金字塔编码网络（FPENet），它在准确性和速度之间取得了很好的折衷。我们使用一系列特征金字塔编码（FPE）块在编码器中以多个比例对上下文进行编码。每个FPE块均由不同的深度扩展卷积组成，这些卷积充当空间金字塔以提取特征并降低计算成本。在训练过程中，采用单发式神经体系结构搜索算法，以较大的搜索空间以较小的搜索成本为每个FPE块找到最佳结构。搜索编码器后，在解码器中引入了一个相互嵌入的上采样模块，该模块由两个注意模块组成。编码器-解码器注意机制用于帮助有效地聚合高级语义特征和低级空间细节。拟议的网络以更少的参数和对Cityscapes和CamVid基准数据集的推理速度提高了现有的实时方法。具体来说，在Nvidia Titan V100 GPU上，仅使用0.4 M参数和192.6 FPS速度的Cityscapes测试集，其平均IoU达到72.3％，而在高分辨率图像上运行时，其平均IoU达到116.2 FPS。

更新日期：2021-05-17

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11