Rich feature hierarchies for accurate object detection and semantic segmentation

RCNN 物体检测

模型结构

整体模型包含三个自模型：
* first generates category-independent region proposals, these proposals define the set of candidate detections available to our detector.
* second is a large convolutional neural network that extracts a fixed-length feature vector from each region.
* third is a set of class specific linear SVMs

Region proposals

selective search

Feature extraction

extract a 4096-dim feature vector from region proposal using Caffe
为了使region的大小为227*227，we warp all pixels in a tight bounding box around it to the required size

Test-time detection

测试阶段，使用selective search提取了2000个region，warp这些区域，输入到CNN中来提取特征，然后，对于每一类，使用SVM进行打分，对于一张图片所有打好分的区域，we apply a greedy non-maximum suppression (for each class independently) that rejects a region if it has an intersection-over-union overlap with a higher scoring selected region large than a learned threshold.

Training

Supervised pre-training

pre-training

Domain-specific fine-tuning

为了将模型中的CNN应用到new task(detection)和new domain(warped proposal windows)，使用SGD，接着训练。N+1个类别，1 指的是背景。

We treat all region proposals with >= 0.5 IoU overlap with a ground-truth box as posistives and the rest as negatives.

In each SGD iteration, we uniformly sample 32 positive windows (over all classes) and 96 background windows to construct a mini-batch of size 128.

We bias the sampling towards positive windows because they are extremely rare compared to background.

Object category classifiers

当一个图片紧紧包含一个物体，很明显为正类，当一点点物体不包含，很明显不包含。那当与物体只是有交叉部分时怎样判断？
We resolve this issue with an IoU overlap threshold, below which regions are defined as negatives. The overlap threshold, 0.3, was selected by a grid search over {0, 0.1, …, 0.5} on a validation set.

Since the training data is too large to ﬁt in memory, we adopt the standard hard negative mining method

Visualization, ablation and modes of error

Visualizing learned features

We propose a simple (and complementary) non-parametric method that directly shows what the network learned.

We compute the unit’s activations on a large set of held-out region proposals, perform non-maximum suppression, and then display the top-scoring regions.

Bounding-box regression

正对每个proposal使用SVM打分后，预测一个新的bounding-box regressor.
We regress from features computed by the CNN, rather than from geometric features computed on the inferred DPM part location.

RCNN 论文笔记

RCNN 论文 笔记