【FPN】《Feature Pyramid Networks for Object Detection》

CVPR 2017

rcnn系列在单个scale的feature map做检测 (b)，尽管conv已经对scale有些鲁棒了，但是还是不够。物体各种各样的scale还是是个难题，尤其是小物体，所以有很多论文在这上面做工作，最简单的做法就是类似于数据增强了，train时把图片放缩成不同尺度送入网络进行训练，但是图片变大很吃内存，一般只在测试时放缩图片，这样一来测试时需要测试好几遍时间就慢了(a)。另一种就是SSD的做法(c)，在不同尺度的feature map上做检测，按理说它该在计算好的不同 scale 的 feature map 上做检测，但是它放弃了前面的low-level 的 feature map，而是从 conv4_3 开始用而且在后面加了一些 conv，生成更多高层语义的 feature map 在上面检测.

所以本文就想即利用 conv net 本身的这种已经计算过的不同 scale 的 feature，又想让 low-level 的高分辩的 feature具有很强的语义，所以自然的想法就是把 high-level 的低分辨的 feature map 融合过来。类似的工作还有RON: Reverse Connection with Objectness Prior Networks for Object Detection

通常卷积神经网络中都会使用这两种类型的features: 卷积神经网络的前几层学习low level feature，后几层学习的是high level feature。作者 combines low-resolution, semantically strong features with high-resolution, semantically weak features.

2 Notion

Low-level feature：通常是指图像中的一些小的细节信息，例如edge、corner、color、pixeles、gradients等，这些信息可以通过滤波器、SIFT或HOG获取；
High level feature：是建立在low level feature之上的，可以用于图像中目标或物体形状的识别和检测，具有更丰富的语义信息。
Image pyramid

【FPN】《Feature Pyramid Networks for Object Detection》

3 Advantages

In ablation experiments, we find that for bounding box proposals, FPN significantly increases the Average Recall (AR) by 8.0 points; for object detection, it improves the COCO-style Average Precision (AP) by 2.3 points and PASCAL-style AP by 3.8 points.
In addition, our pyramid structure can be trained end-to-end with all scales and is used consistently at train/test time, which would be memory-infeasible using image pyramids.

4 Feature Pyramid Networks

4.1 Bottom-up pathway

作者用的是ResNet

We denote the output of these last residual blocks as ｛C2;C3;C4;C5｝ for conv2, conv3, conv4, and conv5 outputs, and note that they have strides of ｛4, 8, 16, 32｝**pixels with **respect to the input image.

4.2 Top-down pathway and lateral connections

we upsample the spatial resolution by a factor of 2 (using nearest neighbor upsampling for simplicity). The upsampled map is then merged with the corresponding bottom-up map (which undergoes a 1×1 convolutional layer to reduce channel dimensions) by element-wise addition.

Designing better connection modules is not the focus of this paper, so we opt for the simple design described above.

4.3 利用FPN构建Faster R-CNN检测器步骤

首先，选择一张需要处理的图片，然后对该图片进行预处理操作；
然后，将处理过的图片送入预训练的特征网络中（如ResNet等），即构建所谓的bottom-up网络；
接着，如图5所示，构建对应的top-down网络（即对层4进行上采样操作，先用1x1的卷积对层2进行降维处理，然后将两者相加（对应元素相加），最后进行3x3的卷积操作，最后）；
接着，在图中的4、5、6层上面分别进行RPN操作，即一个3x3的卷积后面分两路，分别连接一个1x1的卷积用来进行分类和回归操作；
接着，将上一步获得的候选ROI分别输入到4、5、6层上面分别进行ROI Pool操作（固定为7x7的特征）；
最后，在上一步的基础上面连接两个1024层的全连接网络层，然后分两个支路，连接对应的分类层和回归层；

5 Applications

5.1 Feature Pyramid Networks for RPN

RPN is a sliding-window class-agnostic object detector.

Because the head slides densely over all locations in all pyramid levels, it is not necessary to have multi-scale anchors on a specific level. Instead, we assign anchors of a single scale to each level.

之前一层，anchor 多个 scale
现在多层，anchor 一个 scale

RPN生成roi后对应feature时在哪个level上取呢？
$k_{0}$ 是faster rcnn时在哪取的feature map如resnet那篇文章是在C4取的， $k_{0}$ =4 (C5相当于fc，也有在C5取的，在后面再多添加fc)，比如roi是 $w / 2$ , $h / 2$ （ $w * h = 224$ ），那么 $k = k_{0} - 1 = 4 - 1 = 3$

5.2 Feature Pyramid Networks for Fast RCNN

Fast R-CNN is a region-based object detector in which Region-of-Interest (RoI) pooling is used to extract features. Fast R-CNN is most commonly performed on a single-scale feature map. To use it with our FPN, we need to assign RoIs of different scales to the pyramid levels.

6 Experiments on Object Detection

6.1 Region Proposal with RPN

看看加入FPN的RPN网络的有效性，如下表Table1。网络这些结果都是基于ResNet-50。评价标准采用AR，AR表示Average Recall，AR右上角的100表示每张图像有100个anchor，AR的右下角s，m，l表示COCO数据集中object的大小分别是小，中，大。feature列的大括号{}表示每层独立预测。

从（a）（b）（c）的对比可以看出FRN的作用确实很明显。另外（a）和（b）的对比可以看出高层特征并非比低一层的特征有效。

6.1.1 How important is top-down enrichment?

Table 1(d)

表示只有横向连接，而没有自顶向下的过程，也就是仅仅对自底向上（bottom-up）的每一层结果做一个1*1的横向连接和3*3的卷积得到最终的结果，有点像Fig1的（b）。从feature列可以看出预测还是分层独立的。作者推测（d）的结果并不好的原因在于在自底向上的不同层之间的semantic gaps比较大。

6.1.2 How important are lateral connections?

Table 1(e)
这样效果也不好的原因在于目标的location特征在经过多次降采样和上采样过程后变得更加不准确。

6.1.3 How important are pyramid representations?

Table 1(f)

6.2 Object Detection with Fast/Faster RCNN

fast rcnn

faster rcnn

6.3 Comparing with COCO CompetitionWinners

7 Extensions: Segmentation Proposals

其它的应用
Our method is a generic pyramid representation and can be used in applications other than object detection（to generate segmentation proposals）.

8 CVPR 现场 QA：

不同深度的 feature map 为什么可以经过 upsample 后直接相加？

A：作者解释说这个原因在于我们做了 end-to-end 的 training，因为不同层的参数不是固定的，不同层同时给监督做 end-to-end training，所以相加训练出来的东西能够更有效地融合浅层和深层的信息。

为什么 FPN 相比去掉深层特征 upsample(bottom-up pyramid) 对于小物体检测提升明显？（RPN 步骤 AR 从 30.5 到 44.9，Fast RCNN 步骤 AP 从 24.9 到 33.9）

A：作者在 poster 里给出了这个问题的答案

对于小物体，一方面我们需要高分辨率的 feature map 更多关注小区域信息，另一方面，如图中的挎包一样，需要更全局的信息更准确判断挎包的存在及位置。

如果不考虑时间情况下，image pyramid 是否可能会比 feature pyramid 的性能更高？

A：作者觉得经过精细调整训练是可能的，但是 image pyramid 主要的问题在于时间和空间占用太大，而 feature pyramid 可以在几乎不增加额外计算量情况下解决多尺度检测问题。

参考
【1】FPN详解
【2】FPN（feature pyramid networks）算法讲解
【3】FPN解读
【4】CVPR 2017论文解读：特征金字塔网络FPN
【5】计算机视觉中low-level feature和high level feature的理解

【FPN】《Feature Pyramid Networks for Object Detection》

目录

1 Motivation

2 Notion

3 Advantages