三、卷积神经网络结构及其发展历程--深度学习EECS498/CS231n
AlexNet
Output size:
通道数和滤波器数量保持一致,均为64
H/W = (H - K + 2P)/S + 1 = (227-11+4)/4+1=56
Memory(KB):
Number of output elements: C * H* W=64*56 *56=200704; Bytes per element=4 (for 32-bit floating point). KB=200704 * 4/1024 = 784
Parameters(k):
Weight shape = Cout *Cin *K *K=64 *3 *11 *11
Bias shape = 64
Number of weights = 64x3x11x11 + 64 =23296
flop(M)!!!important
number of floating point operations (multiply+add)//since they can be done in one cycle
= (number of output elements) x (ops per element)
=(Cout x H x W) x (Cin xK xK) = 72855552
此处省略了紧随conv1之后的ReLu
对于pooling:
- 不该表channel数量
- flop(M) = (number of output positions) * (flops per output position) = (Cout *H *W)x(KxK)=0.4MFLOP,注意和conv相比,pooling的计算代价小到忽略不计
How does AlexNet design? Trial and Error.
VGG
Design rules for VGG:
- All conv are 3x3 stride 1 pad 1
- All max pool are 2x2 stride 2
- After pool, double #channels
- Using convolutional stages, 5 stages for vgg16
- Stage 1: conv-conv-pool
- Stage 2: conv-conv-pool
- Stage 3: conv-conv-pool
- Stage 4: conv-conv-conv-[conv]-pool
- Stage 5: conv-conv-conv-[conv]-pool
Why conv 3x3
Option 1: conv(5x5. c->c)
Params: 25c^2 FLOPs: 25C^2HW
Option 2: conv(3x3, c->c) conv(3x3, c->c)
Params: 18c^2 FLOPs: 18c^2HW 感受野相同,参数更少,计算消耗更小;此外,当我们选择了两个conv,我们可以在这两个conv之间插入一个relu,从而带给我们更多的depth和nonlinear computation
why double channels
注意FLOPS错了
但对pooling之后一半size的输入double通道数,能够使FLOPs保持一致,而conv layers at each spatial resolution take the same amount of computation
GoogLeNet: Inception Module
local unit with parallel branches that is repeated many times throughout the network.
Use 1*1 Bottleneck layers to reduce channels dimensions before the expensive conv
ResNet
what happens when we go deeper?
this is an optimization problem. Deeper models are harder to optimize, in particular don’t learn identity functions to emulate shallow models.
->change the network so learning identity functions with extra layers is easy.
this layer can now easily learn the identity function, if we set the weights of these two conv layers to zero, this block should compute the identity and makes the dnn easier to emulate the shallow networks. And it also help to improve the gradient flow of deep networks because the add gates make one copy of gradient and pass it through the shortcuts.
Learn from VGG: stages, 3x3 conv
Learn from Google: aggressive stem to downsample the input before applying residual blocks. and global pooling to avoid expensive FC