【PVANet】《PVANET:Deep but Lightweight Neural Networks for Real-time Object Detection》

【PVANet】《PVANET:Deep but Lightweight Neural Networks for Real-time Object Detection》
arXiv preprint arXiv:1608.08021, 2016.

caffe code :https://github.com/sanghoon/pva-faster-rcnn/blob/master/models/pvanet/example_train/train.prototxt
caffe code 可视化工具:http://ethereon.github.io/netscope/#/editor



1 Background and Motivation

目前目标检测精度还不错,automotive and surveillance 领域有广泛的商业市场,但是速度堪忧,作者从提升速度这个点出发,重新设计了 backbone,遵循 less channels with more layers 的设计准则,在 VOC 07 和 12 上取得了相当不错的结果,且大幅度的降低了 computational cost,做到 Real-time.

2 Advantages / Contributions

  • 83.8% mAP on VOC 2007
  • 82.5% mAP on VOC 2012(2nd place,计算量只有第一名 resnet 的 12.3%)
  • 46 ms/image on Titan X((21.7FPS))

lightweight feature extraction network

3 Innovations

  • 自己设计了整个目标检测网络,light weight 且 精度在线
  • 大大提升速度,做到 real time

4 Method

4.1 C.ReLU: Earlier building blocks in feature generation

【PVANet】《PVANET:Deep but Lightweight Neural Networks for Real-time Object Detection》
C 为 concatenation 的意思,不同于 original C.ReLU(来源于 《Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units》),作者增加了 Scale / Shift 操作,同 Batch normalization 的复原操作,对每个通道进行!这种设计的 motivation 是 In the early stage, output nodes tend to be “paired” such that one node’s activation is the opposite side of another’s. 所以可以把 channels 减半,正负 concatenate 即可,精度相仿!2x speed-up

4.2 Inception: Remaining building blocks in feature generation

【PVANet】《PVANET:Deep but Lightweight Neural Networks for Real-time Object Detection》
作者的 inception 堆叠形式相对于原版的 GoogleNet,少了pooling 的分支,5x5 替换成了 double 3x3,这中形式能很好的捕捉不同尺寸的目标,作者用如下的图进行了解释
【PVANet】《PVANET:Deep but Lightweight Neural Networks for Real-time Object Detection》
哈哈哈,这个第一眼看会有点懵,但是没关系,经历过大风大浪,看这16年的前人工作,首先心理上不能惧怕!仔细分析,原来如此!

上图描述的就是三个 inception block 堆叠的情况,第一层 1,3,5 的感受野 channels 分别为原来的 (12,14,14)(\frac{1}{2},\frac{1}{4},\frac{1}{4}),两层堆叠后,也即 (12,14,14)(12,14,14)(\frac{1}{2},\frac{1}{4},\frac{1}{4})*(\frac{1}{2},\frac{1}{4},\frac{1}{4}),注意感受野的乘法准则即可,1x=x33=5,35=71*x=x,3*3=5, 3*5=7 以此类推,相邻的奇数相乘等于他们下一个奇数!

我们来算下 (12,14,14)(12,14,14)(\frac{1}{2},\frac{1}{4},\frac{1}{4})*(\frac{1}{2},\frac{1}{4},\frac{1}{4}) 的结果,也即第二层的结果,也即感受野 (1,3,5)(1,3,5)(1,3,5)*(1,3,5) 的结果

  • 感受野 1:仅 111*1,也即 1212=14\frac{1}{2}*\frac{1}{2} = \frac{1}{4}
  • 感受野 3:有 131*3313*1,也即 1214+1412=14\frac{1}{2}*\frac{1}{4} + \frac{1}{4}*\frac{1}{2} = \frac{1}{4}
  • 感受野 5:有 151*5515*1333*3,也即 1214+1412+1414=516\frac{1}{2}*\frac{1}{4} + \frac{1}{4}*\frac{1}{2} + \frac{1}{4}*\frac{1}{4} = \frac{5}{16}
  • 感受野 7:有 353*5535*3,也即 1414+1414=18\frac{1}{4}*\frac{1}{4} + \frac{1}{4}*\frac{1}{4} = \frac{1}{8}
  • 感受野 9:仅 555*5,也即 1414=116\frac{1}{4}*\frac{1}{4} = \frac{1}{16}

第三层的计算就是 (14,14,516,18,116)(12,14,14)(\frac{1}{4},\frac{1}{4},\frac{5}{16}, \frac{1}{8}, \frac{1}{16})*(\frac{1}{2},\frac{1}{4},\frac{1}{4}),也即 (1,3,5,7,9)(1,3,5)(1,3,5,7,9)*(1,3,5),用感受野的 “乘法公式”,对应通道的比重相乘即可!

it slows down the growth of receptive fields for some output features so that small-sized objects can be captured precisely.

4.3 HyperNet: Concatenation of multi-scale intermediate outputs

【PVANet】《PVANET:Deep but Lightweight Neural Networks for Real-time Object Detection》x×xx×x C.ReLU 表示 1×1x×x1×11×1→x×x→1×1 模式,其中 x×xx×x 的形式如下图所示
【PVANet】《PVANET:Deep but Lightweight Neural Networks for Real-time Object Detection》

  • inception 中的 # out 表示 concatenation 之后的 111*1
  • resnet 结构中,111*1 的 short cut 用在 stride = 2 和 channels 改变的时候!
  • Multi-scale features 的做法如下:conv3_4 downscale(128)、conv4_4(256)、conv5_4(384) upscale concatenation

【PVANet】《PVANET:Deep but Lightweight Neural Networks for Real-time Object Detection》
图片来自于 [目标检测]PVAnet原理,简单明了

  • RPN 取 convf 的前 128 channels,配合 3×33×3 conv (384 channels) 和 1×11×1 conv (25x(2+4) = 150 channel),5 scale 和 5 ratio (3, 6, 9, 16, 25),(0.5, 0.667, 1.0, 1.5,2.0). 2 是 2 分类,4 是 bbox delta
  • head,after roi pooling 665126*6*51240964096(fc),40964096(fc),2121(20+1类),8484(21*4 bbox delta)

4.4 Deep network training

  • Batch Normalization
  • moving average of loss(keras 有实现,哈哈,这里不再赘述)
  • inception + residual connection(注意,作者在 inception block concatenation 之后,接了 111*1,residual connection 或者 x,或者 conv11conv 1*1,把 inception 1*1 后的结果和 residual connection 相加)

5 Experiments

5.1 Datasets and Training

ILSVRC2012、MS COCO、PASCAL VOC 2007、2012

  • 预训练:ILSVRC2012
  • 然后:MS COCO、PASCAL VOC 2007、2012 trainval 训练
  • fine-tuning:PASCAL VOC 2007、2012 trainval

5.2 VOC 2007

【PVANet】《PVANET:Deep but Lightweight Neural Networks for Real-time Object Detection》

5.3 VOC 2012

【PVANet】《PVANET:Deep but Lightweight Neural Networks for Real-time Object Detection》
MAC(number of adds and multiplications) 很夸张,mAP 和 state-of-art 相仿,2nd place,还顺带说了下 1st 用了一些 trick,比如多尺度测试!!!

6 Conclusion(Own)

C.ReLU 还是给人很大的启发,up sampling 竟然用的 444*4 conv,不过话说好像和 kernel size 无关,这个以后有空得琢磨下!设计网络的思路给人启发!!!