Faster R-CNN (object detection) implemented by Keras for custom data from Google’s Open Images Datas
faster r-cnn :object detection
- PASCAL VOC 2007, 2012, and MS COCO
- “Person”, “Car” and “Mobile phone” (Google’s Open Images Dataset V4.)
brief explanantion:
- r-cnn (search selective):
- uses 2,000 proposed areas (rectangular boxes) from search selective
- 2,000 areas are passed to a pre-trained CNN model
- the outputs (feature maps) are passed to a SVM for classification
- faste r-cnn:
- passes the original image to a pre-trained CNN model
- Search selective algorithm is computed base on the output feature map of the previous step
- ROI pooling layer is used to ensure the standard and pre-defined output size. (valid outputs are passed to a fully connected layer as inputs.)
- two output vectors are used to
- predict the observed object with a softmax classifier
- adapt bounding box localisations with a linear regressor
- faster r-cnn:
- rpn replace ss
RPN is connected to a Conv layer with 3x3 filters, 1 padding, 512 output channels.
2.Similar to Fast R-CNN, ROI pooling is used for these proposed regions (ROIs)
3.a softmax function for classification and linear regression to fix the boxes’ location
code explanation:
Part 1: Extract annotation for custom classes from Google’s Open Images Dataset v4 (Bounding Boxes)
Download and load three .csv files
1. class name
2.download from figure eight
3.downloaded the train-annotaion-bbox.csv
and train-images-boxable.csv
-
rain-images-boxable.csv
- boxable image name
- their URL link
- class-descriptions-boxable.csv
- class name corresponding to their class LabelName
- class name corresponding to their class LabelName
- train-annotations-bbox.csv
- one bounding box (bbox for short) coordinates for one image
- bbox’s LabelName and current image’s ID (ImageID+’.jpg’=Image_name)
-
XMin, YMin
is the top left point of this bbox -
XMax, YMax
is the bottom right point of this bbox.
-
Get the subset of the whole dataset
- 1000 image
-
Person’, ‘Mobile phone’ and ‘Car’ respectively.
1. downloading 3000 image in .txt flie
each row :
file_path -- absolute file path
(x1,y1) and (x2,y2) -- top left and bottom right real coordinates of the original image
class_name: class name of the current bounding box
training :80%
test :20%
expected number of training images and testing images should be :
3x800 -> 2400 and 3x200 -> 600 ( maybe overlapped)
Part 2: Faster R-CNN code
Rebuild the structure of VGG-16 and load pre-trained model (nn_base
)
Prepare training data and training labels (get_anchor_gt
)
input data: annotation.txt file
Calculate rpn for each image (calc_rpn)
feature map shape : 18x25=450
anchor sizes=9
potential anchors: 450x9=4050
we set the anchor to positive if the IOU is >0.7
RPN has many more negative than positive regions : turn off some of the negative regions.
limit the total number of positive regions and negative regions to 256
y_is_box_valid
: this anchor has an object
y_rpn_overlap :this anchor overlaps with the ground-truth bounding box
y_rpn_cls
is (1, 18, 25, 18):
feature map size: 18x25
the fourth shape 18 is from 9x2:
. 9 anchors
each anchor has 2 values for y_is_box_valid
and y_rpn_overlap
respectively.
y_rpn_regr
is (1, 18, 25, 72):
feature map size:18x25
he fourth shape 72 is from 9x4x2:
9 anchors and each anchor has 4 values for tx
, ty
, tw
and th
respectively.
4 value has their own y_is_box_valid
and y_rpn_overlap
Calculate region of interest from RPN (rpn_to_roi)
RoIPooling layer and Classifier layer (RoiPoolingConv, classifier_layer)
RoIPooling layer i: process the roi to a specific size output by max pooling.
input roi is divided into some sub-cells
applied max pooling to each sub-cell
Classifier layer: the final layer of the whole model and just behind the RoIPooling layer.
predict the class name for each input anchor
the regression of their bounding box.
First, the pooling layer is flattened.
Then, it’s followed with two fully connected layer and 0.5 dropout.
Finally, there are two output layers.
# out_class: softmax activation function for classifying the class name of the object
# out_regr: linear activation function for bboxes coordinates regression
Dataset
Car’, ‘Mobile Phone’ and ‘Person’ is 2383, 1108 and 3745 respectively.
Parameters
Environment
Google’s Colab with Tesla K80 GPU acceleration for training.
Training time
each epoch:1000
total number of epochs I trained is 114
Every epoch spends around 700 seconds - total time: 22hours
Result
two loss functions
RPN model has two output: cls; regression
cls low from 20s :
reason
the accuracy for objectness is already high for the early stage of our training
the accuracy of bounding boxes’ coordinates is still low and needs more time to learn.
a similar tendency and even similar loss value: predicting the quite similar value
predicting objectness is easier than predicting the class name of a bbox
sum of four losses above
mAP (mean average precision) doesn’t increase as the loss decreases:
epoch 60: 0.15
epoch 87: 0.19
epoch 114: 0.13
reason: small number of training images which leads to overfitting of the model
Other things we could tune
1. 300 resized [64, 128, 256]
2.vgg-16 simple structure , but retnet-50 is better
3.rpn_max_overlap=0.7
and rpn_min_overla=0.3
is the range to differentiate ‘positive’, ‘neutral’ and ‘negative’ for each anchor. overlap_thresh=0.7
is the threshold for non-max-suppression.