cs231n_lecture_9_常见CNN的结构
上节回顾:深度学习框架——TensorFlow PyTorch Caffe
LeNet-5
LeCun,1998
Architecture: CONV-POOL-CONV-POOL-FC-FC
Conv filters :5x5 stride=1
Subsampling (Pooling)layers :2x2 stride=1
AlexNet
Krizhevsky ,2012 , imageNet challenge
Architecture:CONV1-MAXPOOL1-NORM1--CONV2-MAX POOL2-NORM2-CONV3-CONV4-CONV5-Max POOL3-FC6-FC7-FC8
假设input:227*227*3
CONV1:96 11*11 , stride=4 , pad=0
Q:output=?
(227-11)/4 +1=55,所以是55*55*96
Q:parameters=?
11*11*3*96=35K
POOL1:3*3 filters, stride=2
Q:output=?
(55-3)/2+1=27,所以是27*27*96
Q:parameters=?
Nothing
Full (simplified) AlexNetarchitecture:
[227x227x3] INPUT
[55x55x96] CONV1:96 11x11 filters at stride 4, pad 0
[27x27x96] MAXPOOL1: 3x3 filters at stride 2
[27x27x96] NORM1:Normalization layer
[27x27x256] CONV2:256 5x5 filters at stride 1, pad 2
[13x13x256] MAXPOOL2: 3x3 filters at stride 2
[13x13x256] NORM2:Normalization layer
[13x13x384] CONV3:384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4:384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5:256 3x3 filters at stride 1, pad 1
[6x6x256] MAXPOOL3: 3x3 filters at stride 2
[4096] FC6:4096 neurons
[4096] FC7:4096 neurons
[1000] FC8:1000 neurons (class scores)
Details/Retrospectives:
- first use of ReLU
- used Norm layers (notcommon anymore)
- heavy data augmentation
- dropout 0.5
- batch size 128
- SGD Momentum 0.9
- Learning rate 1e-2,reduced by 10
manually when val accuracy plateaus
- L2 weight decay 5e-4
- 7 CNN ensemble: 18.2%-> 15.4%
ZFNet
Zeiler and Fergus, 2013 ,ImageNet Challenge
在AlexNet上的微调
AlexNet but:
CONV1: change from (11x11 stride 4) to(7x7 stride 2)
CONV3,4,5: instead of 384, 384, 256filters use 512, 1024, 512
把top 5的错误率从16.4%降到了11.7%
VGGNet
Simonyan and Zisserman, 2014,ImageNet Challenge
特点:Small filters, Deeper networks
8 layers (AlexNet) -> 16 - 19 layers (VGG16Net)
Only 3x3 CONV stride 1, pad 1 and 2x2 MAX POOL stride 2
Top 5 error从11.7%降到7.3%
Q:为什么用更小的filters?
3*3 conv (stride=1) layer的叠加可以与一个7*7 conv layer的receptivefield 相同。
第一层是3*3第二层虽然也是3*3但是有重叠信息,实际的RF是5*5,第三层实际的RF就是7*7了。
但是,这样做可以让网络更深,更多的非线性,更少的参数。
假设每一层的channel数为C,7^2*C^2减少到3*(3^2*C^2)
感受下NN的参数量:
以VGG16为例
conv和fc会带来大量的参数。
Details:
- ILSVRC’14 2nd in classification, 1st in localization(用框框出来)
- Similar training procedure as Krizhevsky 2012
- No Local Response Normalisation (LRN)
- Use VGG16 or VGG19 (VGG19 only slightly better, more memory)
- Use ensembles for best results
- FC7 features generalize well to other tasks
GoogLeNet
Szegedy,2014,ImageNet Challenge
- 22 layers
- Efficient “Inception” module
- No FC layers
- Only 5 million parameters! 12x less than AlexNet
- ILSVRC’14classification winner (6.7% top 5 error)
“Inceptionmodule”: design a good localnetwork topology (network within a network) and then stack these modules on topof each other
但是这样会使计算量猛增,假设输入是28*28*256,stride=1,padding使矩阵大小不变:
怎样减小计算量呢?——使用1*1的conv保留地mentions,把depth降下来!
这样做会使信息丢失吗?
会,但这种combination也增加了非线性。
值得一提的是:auxiliary loss(偶然损失)是为了防止梯度在BP时消失。
ResNet
何凯明,孙剑等,2015,ImageNet Challenge
- 152-layer model for ImageNet
- ILSVRC’15classification winner (3.57% top 5 error)
- Swept all classification anddetection competitions in ILSVRC’15 andCOCO’15!
如果一直将NN变得更deeper会怎么样?理论上应该效果更好,but
更深的网络表现却更差,但是从test error我们可以看出不是由overfitting造成的!
假设:问题是有最优化引起的,deeper models难以达到最优化
方法:使用一些网络层保证残差映射而不是直接寻找期望的映射,F(x)是残差 (why?)
- Stack residual blocks
- Every residual block has two 3x3 conv layers
- Periodically, double # of filters and downsample spatially usingstride 2 (/2 in each dimension)
- Additional conv layer at the beginning
- No FC layers at the end (only FC 1000 to output classes)
For deeper networks (ResNet-50+), use “bottleneck”layer to improve efficiency (similar to GoogLeNet):
Training ResNet in practice:
- Batch Normalization after every CONV layer
- Xavier/2 initialization from He et al.
- SGD + Momentum (0.9)
- Learning rate: 0.1, divided by 10 when validation error plateaus
- Mini-batch size 256
- Weight decay of 1e-5
- No dropout used
1.解决了或者说明了很深的网络训练时也可以不退化的
2.实现了很深的网络具备低训练误差的预期
3.ILSVRC 2015 classification winner (3.6% top 5 error)-- better than “humanperformance”!
比较:横轴是计算量,纵轴是准确率
一些其他结构:
Network in Network (NiN)
Lin Min,2014
1.提出micronetwork
2.提出bottleneck layers
这些都启发了GoogleLeNet
Identity Mappings in Deep Residual Networks
何凯明等,2016
1.moves activation to residual mappingpathway
Wide Residual Networks
Zagoruyko,2016
1.Argues that residuals are the importantfactor, not depth
2.Use wider residual blocks (F x k filtersinstead of F filters in each layer
3.Increasing width instead of depth more computationally efficient (parallelizable)
Aggregated Residual Transformationsfor Deep Neural Networks (ResNeXt)
Xie,2016来自ResNet的发明者)
1.Increases width of residual blockthrough multiple parallel pathways (“cardinality”)
Deep Networks with Stochastic Depth
Huang et al. 2016
- Motivation: reduce vanishinggradients and training time through short networks during training
- Randomly drop a subset of layersduring each training pass
- Bypass with identity function
- Use full deep network at test time
FractalNet: Ultra-Deep Neural Networks without Residuals
Larsson et al. 2017
- Argues that key is transitioning effectivelyfrom shallow to deep and residual representations are not necessary
- Fractal architecture with both shallowand deep paths to output
- Trained with dropping out sub-paths
- Full network at test time
Densely Connected ConvolutionalNetworks
Huang et al. 2017
- Dense blocks where each layer is connected to every other layer in feed forward fashion
- Alleviates vanishing gradient, strengthens feature propagation, encourages feature reuse
SqueezeNet: AlexNet-level Accuracy With 50x Fewer Parameters and<0.5Mb Model Size
Iandola et al. 2017
1.Fire modules consisting of a ‘squeeze’ layerwith 1x1 filters feeding an ‘expand’ layer with 1x1 and 3x3 filters