00037-Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation
Author: Liang-CHieh Chen et. al--- Google Inc.
DeepLabv3+ : extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries.
Atrous Spatial Pyramial Pooling:
In this work, we consider two types of neural networks that use spatial pyramid pooling module [18,19,20] or encoder-decoder structure [21,22] for semantic segmentation, where the former one captures rich contextual information by pooling features at dierent resolution while the latter one is able to obtain sharp object boundaries.
- We propose a novel encoder-decoder structure which employs DeepLabv3 as a powerful encoder module and a simple yet eective decoder module.
- In our structure, one can arbitrarily control the resolution of extracted en-coder features by atrous convolution to trade-o precision and runtime,which is not possible with existing encoder-decoder models.
- We adapt the Xception model for the segmentation task and apply depthwise separable convolution to both ASPP module and decoder module, resulting in a faster and stronger encoder-decoder network.
- Our proposed model attains a new state-of-art performance on PASCAL VOC 2012 and Cityscapes datasets. We also provide detailed analysis of design choices and model variants.
- We make our Tensor ow-based implementation of the proposed model pub-licly available at https://github.com/tensorflow/models/tree/master/research/deeplab.
Use DeepLabv3 as the encoder module and add a simple yet effective decoder module to obtain sharper segmentations.
Depthwise separable convolution:
3.1 Encoder-Decoder with Atrous Convolution
Depthwise separable convolution:
drastically reduces computation complexity
We use the last feature map before logits in the original DeepLabv3 as the encoder output in our proposed encoder-decoder structure.
We apply another 1 1 convolution on the low-level features to reduce the number of channels, since the corresponding low-level features usually contain a large number of channels (e.g., 256 or 512) which may outweigh the importance of the rich encoder features (only 256 channels in our model) and make the training harder. After the concatenation, we apply a few 3 3 convolutions to rene the features followed by another simple bilinear upsampling by a factor of 4.
We define "DeepLabv3 feature map" as the last feature map computed by DeepLabv3 (i.e., the features containing ASPP features and image-level fea-tures), and [k X k; f] as a convolution operation with kernel k X k and f flters.
In the decoder module, we consider three places for dierent de-sign choices, namely (1) the 1 X 1 convolution used to reduce the channels of the low-level feature map from the encoder module, (2) the 3 X 3 convolution used to obtain sharper segmentation results, and (3) what encoder low-level features should be used.
We do not pursue further denser output feature map (i.e.,output stride < 4) given the limited GPU resources.
4.2 ResNet-101 as Network Backbone
4.3 Xception as Networkk Backbone
4.4 Improvement along Object Boundaries