论文阅读:Bag of Freebies for Training Object Detection Neural Networks
文章目录
1、论文总述
这篇论文首次提出了mixup数据增强方法在目标检测里应该如何使用,提供了几个目标检测网络训练时的tricks供我们使用,这几个tricks并不增加网络参数,只是在网络训练时使用,网络推理时没有额外的花销,所以叫Bag of Freebies,这个词也在yolov4中使用。
看到一篇总结的很全的博客,复习时可以参考这个博客,与这篇博客内容重复的就不自己写了,只写一些自己在阅读论文时勾画的一些知识点。
值得注意的一点:虽然只涨了几个点,但是提升了模型的泛华能力,实验结果也说明了One-stage的目标检测网络更需要数据增强,two-stage的涨点很少
In this work, we focus on exploring effective and general
approaches that can boost the performance of all popular
object detection networks without introducing extra computational cost during inference. We first explore the mixup
technique on object detection. Unlike [24], we recognize
the special property of multiple object detection task which
favors spatial preserving transforms. Therefore we proposed a visually coherent image mixup methods designed
for object detection tasks. Second, we explore detailed
training pipelines including learning rate scheduling, label
smoothing and synchronized BatchNorm [25, 14]. Third,
we investigate the effectiveness of our training tweaks by in crementally stacking them to train single and multiple stage
object detection networks.
2、mixup操作图示
其中两张图片的权重是根据beta分布来进行选取的,β分布的数学意义建议自己查一下文献,
以下是β分布实验时的取值:
mixup对网络预测的confidence的影响:
In comparison, models trained with our mix approach is more robust thanks to
randomly generated visually deceptive training images. In
addition, we also notice that mix model is more humble, less
confident and generates lower scores for objects on average.
However, this behavior does not affect evaluation results as
shown in experimental results。
confidence会变低,更谦虚
We recognize that mixup model
receives more challenges during training therefore is significantly better than vanilla model in handling unprecedented
scenes and very crowded object groups.
使用Mixup可以让网络在未知场景和拥挤有遮挡的场景下更加鲁邦
3、the major difference between one-stage and so called multi-stage object detection data pipelines.
In terms of types of detection networks, there are two
pipelines for generating final predictions. First is single
stage detector network, where final outputs are generated
from every single cell in the feature map, for example
SSD[12] and YOLO[16] networks which generate detection results proportional to spatial shape of an input image.
The second is multi-stage proposal and sampling based approaches, following Fast-RCNN[17], where a certain number of candidates are sampled from a large pool of generated
ROIs, then the detection results are produced by repeatedly
cropping the corresponding regions on feature maps, and
the number of predictions is proportional to number of samples.
Since sampling-based approaches conduct enormous
cropping operations on feature maps, it substitutes the operation of randomly cropping input images, therefore these
networks do not require extensive geometric augmentations
applied during the training stage. This is the major difference between one-stage and so called multi-stage object detection data pipelines. In our Faster-RCNN training, we do
not use random cropping techniques during data augmentation.
4、warmup lr防止训练初期的梯度爆炸
Warmup learning rate is another common strategy to
avoid gradient explosion during the initial training iterations. Warmup learning rate schedule is critical to several
object detection algorithms, e.g., YOLOv3, which has a
dominant gradient from negative examples in the very beginning iterations where sigmoid classification score is initialized around 0.5 and biased towards 0 for the majority
predictions.
参考文献
目标检测训练优化Tricks:《Bag of Freebies for Training Object Detection Neural Networks》