cs231n_lecture_9_常见CNN的结构

上节回顾:深度学习框架——TensorFlow PyTorch Caffe 


LeNet-5

LeCun,1998

Architecture: CONV-POOL-CONV-POOL-FC-FC 

Conv filters :5x5  stride=1

Subsampling (Pooling)layers :2x2   stride=1

 

AlexNet

Krizhevsky ,2012 ,  imageNet challenge

Architecture:CONV1-MAXPOOL1-NORM1--CONV2-MAX POOL2-NORM2-CONV3-CONV4-CONV5-Max POOL3-FC6-FC7-FC8

 

假设input:227*227*3

CONV1:96  11*11 , stride=4 , pad=0

Q:output=?

(227-11)/4 +1=55,所以是55*55*96

Q:parameters=?

11*11*3*96=35K

 

POOL1:3*3 filters, stride=2

Q:output=?

(55-3)/2+1=27,所以是27*27*96

Q:parameters=?

Nothing

cs231n_lecture_9_常见CNN的结构

Full (simplified) AlexNetarchitecture:

[227x227x3] INPUT

[55x55x96] CONV1:96 11x11 filters at stride 4, pad 0

[27x27x96] MAXPOOL1: 3x3 filters at stride 2

[27x27x96] NORM1:Normalization layer

[27x27x256] CONV2:256 5x5 filters at stride 1, pad 2

[13x13x256] MAXPOOL2: 3x3 filters at stride 2

[13x13x256] NORM2:Normalization layer

[13x13x384] CONV3:384 3x3 filters at stride 1, pad 1

[13x13x384] CONV4:384 3x3 filters at stride 1, pad 1

[13x13x256] CONV5:256 3x3 filters at stride 1, pad 1

[6x6x256] MAXPOOL3: 3x3 filters at stride 2

[4096] FC6:4096 neurons

[4096] FC7:4096 neurons

[1000] FC8:1000 neurons (class scores)


Details/Retrospectives:

- first use of ReLU

- used Norm layers (notcommon anymore)

- heavy data augmentation

- dropout 0.5

- batch size 128

- SGD Momentum 0.9

- Learning rate 1e-2,reduced by 10

manually when val accuracy plateaus

- L2 weight decay 5e-4

- 7 CNN ensemble: 18.2%-> 15.4%

 

ZFNet

Zeiler and Fergus, 2013 ,ImageNet Challenge

在AlexNet上的微调

AlexNet but:

CONV1: change from (11x11 stride 4) to(7x7 stride 2)

CONV3,4,5: instead of 384, 384, 256filters use 512, 1024, 512

把top 5的错误率从16.4%降到了11.7%

 

VGGNet

Simonyan and Zisserman, 2014,ImageNet Challenge

特点:Small filters, Deeper networks

8 layers (AlexNet)  ->   16 - 19 layers (VGG16Net)

Only 3x3 CONV stride 1, pad 1 and 2x2 MAX POOL stride 2

Top 5 error从11.7%降到7.3%

Q:为什么用更小的filters?

3*3 conv (stride=1) layer的叠加可以与一个7*7 conv layer的receptivefield 相同。

第一层是3*3第二层虽然也是3*3但是有重叠信息,实际的RF是5*5,第三层实际的RF就是7*7了。

但是,这样做可以让网络更深,更多的非线性,更少的参数。

假设每一层的channel数为C,7^2*C^2减少到3*(3^2*C^2)

cs231n_lecture_9_常见CNN的结构

感受下NN的参数量:

VGG16为例

cs231n_lecture_9_常见CNN的结构

conv和fc会带来大量的参数。

Details:

- ILSVRC14 2nd in classification, 1st in localization(用框框出来)

- Similar training procedure as Krizhevsky 2012

- No Local Response Normalisation (LRN)

- Use VGG16 or VGG19 (VGG19 only slightly better, more memory)

- Use ensembles for best results

- FC7 features generalize well to other tasks


GoogLeNet

Szegedy2014,ImageNet Challenge

- 22 layers

- Efficient Inception module

- No FC layers

- Only 5 million parameters! 12x less than AlexNet

- ILSVRC14classification winner (6.7% top 5 error)


Inceptionmodule: design a good localnetwork topology (network within a network) and then stack these modules on topof each other

cs231n_lecture_9_常见CNN的结构

但是这样会使计算量猛增,假设输入是28*28*256stride=1padding使矩阵大小不变:

cs231n_lecture_9_常见CNN的结构

cs231n_lecture_9_常见CNN的结构

怎样减小计算量呢?——使用1*1的conv保留地mentions,把depth降下来!

cs231n_lecture_9_常见CNN的结构

cs231n_lecture_9_常见CNN的结构

这样做会使信息丢失吗?

会,但这种combination也增加了非线性。


cs231n_lecture_9_常见CNN的结构

值得一提的是:auxiliary loss(偶然损失)是为了防止梯度在BP时消失。 

ResNet

何凯明,孙剑等,2015,ImageNet Challenge

- 152-layer model for ImageNet

- ILSVRC15classification winner (3.57% top 5 error)

- Swept all classification anddetection competitions in ILSVRC15 andCOCO15!

如果一直将NN变得更deeper会怎么样?理论上应该效果更好,but

cs231n_lecture_9_常见CNN的结构

更深的网络表现却更差,但是从test error我们可以看出不是由overfitting造成的!
 

假设:问题是有最优化引起的,deeper models难以达到最优化

方法:使用一些网络层保证残差映射而不是直接寻找期望的映射,F(x)是残差   (why?)

cs231n_lecture_9_常见CNN的结构

- Stack residual blocks

- Every residual block has two 3x3 conv layers

- Periodically, double # of filters and downsample spatially usingstride 2 (/2 in each dimension)

- Additional conv layer at the beginning

- No FC layers at the end (only FC 1000 to output classes)


For deeper networks (ResNet-50+), use bottlenecklayer to improve efficiency (similar to GoogLeNet)

cs231n_lecture_9_常见CNN的结构

Training ResNet in practice:

- Batch Normalization after every CONV layer

- Xavier/2 initialization from He et al.

- SGD + Momentum (0.9)

- Learning rate: 0.1, divided by 10 when validation error plateaus

- Mini-batch size 256

- Weight decay of 1e-5

- No dropout used

1.解决了或者说明了很深的网络训练时也可以不退化的

2.实现了很深的网络具备低训练误差的预期

3.ILSVRC 2015 classification winner (3.6% top 5 error)-- better than humanperformance!


比较:横轴是计算量,纵轴是准确率

cs231n_lecture_9_常见CNN的结构

一些其他结构:

Network in Network (NiN)

Lin Min2014

1.提出micronetwork

2.提出bottleneck layers

这些都启发了GoogleLeNet


Identity Mappings in Deep Residual Networks

何凯明等,2016

1.moves activation to residual mappingpathway

cs231n_lecture_9_常见CNN的结构

Wide Residual Networks

Zagoruyko2016

1.Argues that residuals are the importantfactor, not depth

2.Use wider residual blocks (F x k filtersinstead of F filters in each layer

3.Increasing width instead of depth more computationally efficient (parallelizable)

Aggregated Residual Transformationsfor Deep Neural Networks (ResNeXt)

Xie2016来自ResNet的发明者)

1.Increases width of residual blockthrough multiple parallel pathways (cardinality)

cs231n_lecture_9_常见CNN的结构

Deep Networks with Stochastic Depth

Huang et al. 2016

- Motivation: reduce vanishinggradients and training time through short networks during training

- Randomly drop a subset of layersduring each training pass

- Bypass with identity function

- Use full deep network at test time

cs231n_lecture_9_常见CNN的结构


FractalNet: Ultra-Deep Neural Networks without Residuals

Larsson et al. 2017

- Argues that key is transitioning effectivelyfrom shallow to deep and residual representations are not necessary

- Fractal architecture with both shallowand deep paths to output

- Trained with dropping out sub-paths

- Full network at test time


Densely Connected ConvolutionalNetworks

Huang et al. 2017

- Dense blocks where each layer is connected to every other layer in feed forward fashion

- Alleviates vanishing gradient, strengthens feature propagation, encourages feature reuse

cs231n_lecture_9_常见CNN的结构

SqueezeNet: AlexNet-level Accuracy With 50x Fewer Parameters and<0.5Mb Model Size

Iandola et al. 2017

1.Fire modules consisting of a squeeze layerwith 1x1 filters feeding an expand layer with 1x1 and 3x3 filters