00037-Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Author: Liang-CHieh Chen et. al--- Google Inc.

Keywords:

DeepLabv3+ : extends DeepLabv3 by adding a simple yet eﬀective decoder module to reﬁne the segmentation results especially along object boundaries.

Atrous Spatial Pyramial Pooling:

1. Introduction

In this work, we consider two types of neural networks that use spatial pyramid pooling module [18,19,20] or encoder-decoder structure [21,22] for semantic segmentation, where the former one captures rich contextual information by pooling features at dierent resolution while the latter one is able to obtain sharp object boundaries.

00037-Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Contributions:

We propose a novel encoder-decoder structure which employs DeepLabv3 as a powerful encoder module and a simple yet eective decoder module.
In our structure, one can arbitrarily control the resolution of extracted en-coder features by atrous convolution to trade-o precision and runtime,which is not possible with existing encoder-decoder models.
We adapt the Xception model for the segmentation task and apply depthwise separable convolution to both ASPP module and decoder module, resulting in a faster and stronger encoder-decoder network.
Our proposed model attains a new state-of-art performance on PASCAL VOC 2012 and Cityscapes datasets. We also provide detailed analysis of design choices and model variants.
We make our Tensor ow-based implementation of the proposed model pub-licly available at https://github.com/tensorflow/models/tree/master/research/deeplab.

00037-Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

2 Related Work

Spatial pyramid pooling:

PASS

Encoder-decoder:

Use DeepLabv3 as the encoder module and add a simple yet effective decoder module to obtain sharper segmentations.

Depthwise separable convolution:

3 Methods

3.1 Encoder-Decoder with Atrous Convolution

Atrous convolution:

00037-Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Depthwise separable convolution:

drastically reduces computation complexity

DeepLabv3 as encoder:

We use the last feature map before logits in the original DeepLabv3 as the encoder output in our proposed encoder-decoder structure.

Proposed decoder:

We apply another 1 1 convolution on the low-level features to reduce the number of channels, since the corresponding low-level features usually contain a large number of channels (e.g., 256 or 512) which may outweigh the importance of the rich encoder features (only 256 channels in our model) and make the training harder. After the concatenation, we apply a few 3 3 convolutions to rene the features followed by another simple bilinear upsampling by a factor of 4.

3.2 Modified Aligned Xception

00037-Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

4. Experimental Evaluation

4.1 Decoder Design Choices

We define "DeepLabv3 feature map" as the last feature map computed by DeepLabv3 (i.e., the features containing ASPP features and image-level fea-tures), and [k X k; f] as a convolution operation with kernel k X k and f flters.

In the decoder module, we consider three places for dierent de-sign choices, namely (1) the 1 X 1 convolution used to reduce the channels of the low-level feature map from the encoder module, (2) the 3 X 3 convolution used to obtain sharper segmentation results, and (3) what encoder low-level features should be used.

We do not pursue further denser output feature map (i.e.,output stride < 4) given the limited GPU resources.

4.2 ResNet-101 as Network Backbone

00037-Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

4.3 Xception as Networkk Backbone

ImageNet pretraining:

00037-Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

The result is

00037-Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Baseline:

Table 5 second row

Adding decoder:

Table 5 third row

Pretraining on COCO:

Pretraining on JFT

Test set results:

Qualitative results:

00037-Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

4.4 Improvement along Object Boundaries

00037-Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

4.5 Experimental Results on Cityscapes

5 Conclusion

00037-Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

相关推荐