RCNN 论文 笔记
Rich feature hierarchies for accurate object detection and semantic segmentation
RCNN 物体检测
模型结构
整体模型包含三个自模型:
* first generates category-independent region proposals, these proposals define the set of candidate detections available to our detector.
* second is a large convolutional neural network that extracts a fixed-length feature vector from each region.
* third is a set of class specific linear SVMs
Region proposals
selective search
Feature extraction
extract a 4096-dim feature vector from region proposal using Caffe
为了使region的大小为227*227,we warp all pixels in a tight bounding box around it to the required size
Test-time detection
测试阶段,使用selective search提取了2000个region,warp这些区域,输入到CNN中来提取特征,然后,对于每一类,使用SVM进行打分,对于一张图片所有打好分的区域,we apply a greedy non-maximum suppression (for each class independently) that rejects a region if it has an intersection-over-union overlap with a higher scoring selected region large than a learned threshold.
Training
Supervised pre-training
pre-training
Domain-specific fine-tuning
为了将模型中的CNN应用到new task(detection)和new domain(warped proposal windows), 使用SGD,接着训练。N+1个类别,1 指的是背景。
We treat all region proposals with >= 0.5 IoU overlap with a ground-truth box as posistives and the rest as negatives.
In each SGD iteration, we uniformly sample 32 positive windows (over all classes) and 96 background windows to construct a mini-batch of size 128.
We bias the sampling towards positive windows because they are extremely rare compared to background.
Object category classifiers
当一个图片紧紧包含一个物体,很明显为正类,当一点点物体不包含,很明显不包含。那当与物体只是有交叉部分时怎样判断?
We resolve this issue with an IoU overlap threshold, below which regions are defined as negatives. The overlap threshold, 0.3, was selected by a grid search over {0, 0.1, …, 0.5} on a validation set.
Since the training data is too large to fit in memory, we adopt the standard hard negative mining method
Visualization, ablation and modes of error
Visualizing learned features
We propose a simple (and complementary) non-parametric method that directly shows what the network learned.
We compute the unit’s activations on a large set of held-out region proposals, perform non-maximum suppression, and then display the top-scoring regions.
Bounding-box regression
正对每个proposal使用SVM打分后,预测一个新的bounding-box regressor.
We regress from features computed by the CNN, rather than from geometric features computed on the inferred DPM part location.