00036-Xception:Deep Learning with Depthwise Separable Convolutions
Author: Francois Chollet ----Google---Keras作者、谷歌大脑
Depthwise Separable Convolutions-------
1. Introduction
A single convolution kernel is tasked with simultaneously mapping cross-channel correlations and spatial correlations.
1.2 The continuum between convolutions and separable convolutions
Two minor differences between and “extreme” version of an Inception module and a depthwise separable convolution would be:
- The order of the operations: depthwise separable convolutions as usually implemented (e.g. in TensorFlow) perform first channel-wise spatial convolution and then perform 1x1 convolution, whereas Inception performs the 1x1 convolution first.
- The presence or absence of a non-linearity after the first operation. In Inception, both operations are followed by a ReLU non-linearity, however depthwise separable convolutions are usually implemented without non-linearities.
2. Prior work
- CNN, in particular the VGG-16
- Inception architecture
- Depthwise separable convolutions
- Residual connections
3. The Xception architecture
we make the following hypothesis: that the mapping of cross-channels correlations and spatial correlations in the feature maps of convolutional neural networks can be entirely decoupled.
In short, the Xception architecture is a linear stack of depthwise separable convolution layers with residual connections.
4 Experimental evaluation
JFT is an internal Google dataset for large-scale image classification dataset, first introduced by Hinton et al. in [5], which comprises over 350 million high-resolution images annotated with labels from a set of 17,000 classes. To evaluate the performance of a model trained on JFT, we use an
auxiliary dataset, FastEval14k.
4.2 Optimization configuration
4.3 Regularization configuration
4.5 Comparison with Inception V3
4.5.1 Classification performance
4.6 Effect of the residual connections
4.7 Effect of an intermediate activation after pointwise convolutions