Object Detection

深度学习的繁荣离不开大量的研究人员开发了更多更高效的算法，这一周的课程学几个小时就为了能讲清楚一篇YOLO在干什么。用的组件太多，技术就变成魔法了。

1. 问题定义

Image classification: 判断图像中有某个物体

Object Localization - Classification with localization: 判断图像在图片中，并标记出位置（矩形区域）¹

Deep Learning Specialization 4: Convolutional Neural Networks - Week3 - Object Detection

Localization - Bounding Box ( $b_x, b_y, b_h, b_w$ )

其中， $(b_x, b_y)$ 是矩形的中心区域， $b_h$ 和 $b_w$ 是高、宽在图像中的占比（不是绝对值）

目标检测的标注 Softmax + Bounding Box
$y = \begin{bmatrix} P_c\\ b_x\\ b_y\\ b_h \\ b_w \\ c_1 \\ c_2 \\ c_3 \\ c_4 \end{bmatrix} y_1 = \begin{bmatrix} 1\\ b_x\\ b_y\\ b_h \\ b_w \\ 0 \\ 1 \\ 0 \\ 0 \end{bmatrix} y_0 = \begin{bmatrix} 0\\ ?\\ ?\\ ? \\ ? \\ ? \\ ? \\ ? \\ ? \end{bmatrix}$

$P_c$ 表示图像中有物体， $c_1...c_4$ 表示要探测的4类对象，当 $P_c = 0$ 时其它内容就都不关注了。

Loss （其中一种）
$\mathcal{L}(\hat{y}, y) = \left\{\begin{matrix} (\hat{y_1} - y_1)^2 + ... + (\hat{y_n}-y_n)^2 & p_c = 1 \\ (\hat{y_1} - y_1)^2 & p_c = 0 \end{matrix}\right.$

2. 目标检测算法(YOLO)

其实也就是介绍了YOLO的原理。

2.1 Sliding windows detection

方法非常直接：使用滑动窗口在图像上移动，并依次放到ConvNet中进行判断
使用多个不同的步长可能能够提高性能（效果）
计算代价比较高

2.2 Convolutional Implementation of Sliding Windows

核心算法：降低目标检测复杂度的重要算法，让实时检测成为可能。

将全连接层转换成卷积层
卷积的滑动过程与目标检测的滑动过程是可以统一的

2.2.1 Turning FC Layer into convolutional layers

分类中的全连接层：上一层 $5\times5\times16$ 直接连接一个 $400\times1$ 的全连接层

卷积实现：上一层 $5\times5\times16$ 连接一个 $5\times5$ 的filter 400个，得到 $1\times1\times400$ 的“全连接层”，再应用 $1\times1$ filter 400个。注意这里其实是与全连接层的计算是等价的。

Deep Learning Specialization 4: Convolutional Neural Networks - Week3 - Object Detection

2.2.2 Convolution implementation of sliding windows

有了全连接层的卷积实现，再进一步得到滑动窗口的卷积实现（卷积操作本身就是一个滑动窗口移动的过程），具体地：

滑动窗口的大小就是filter的大小
stride就是MAX POOL的大小

注意下图中 $16 \times 16 \times 3$ 的图像在 $14 \times 14 \times 3$ 的窗口滑动下是如何变成最右边对应的左上角元素的。

Deep Learning Specialization 4: Convolutional Neural Networks - Week3 - Object Detection

这里解决了核心大头问题，接下来还有一些工作可以继续提高效果

2.3 Bounding Box Predictions

在全局标注的基础上将问题进行了切割

将问题切分成网格（比如 $3 \times 3$ ），对于每一个网格进行标注（变成了9个标注）
$(b_h, b_w)$ 和 $(b_x, b_y)$ 的标注都是以网格单元为基础的，并且 $(b_h, b_w)$ 是可以大于1的（越过网格边界）
只将物体中心 $(b_x, b_y)$ 所在的格子标记为1

2.4 Anchor Boxes

对于目标应用不同的Anchor Box，能更好的区分重叠的不同形状的目标。此时标注也会发生相应变化，比如下面是两个anchor box：
$y = \begin{bmatrix} P_c\\ b_x\\ b_y\\ b_h \\ b_w \\ c_1 \\ c_2 \\ c_3 \\ c_4 \\ P_c\\ b_x\\ b_y\\ b_h \\ b_w \\ c_1 \\ c_2 \\ c_3 \\ c_4 \end{bmatrix} \Rightarrow \begin{bmatrix} 0\\ ?\\ ?\\ ? \\ ? \\ ? \\ ? \\ ? \\ ? \\ 1 \\ b_x\\ b_y\\ b_h \\ b_w \\ 0 \\ 1 \\ 0 \\ 0 \end{bmatrix}$ y=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡Pcbxbybhbwc1c2c3c4Pcbxbybhbwc1c2c3c4⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤⇒⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡0????????1bxbybhbw0100⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤

2.5 评价指标 Intersection over Union (IoU)²

Deep Learning Specialization 4: Convolutional Neural Networks - Week3 - Object Detection

2.6 Non-max suppression

重叠的检测结果（以概率最大的为标准）中只取IoU最大的，下面是算法流程，很直观。

Each output prediction is:
$y = \begin{bmatrix} P_c\\ b_x\\ b_y\\ b_h \\ b_w \end{bmatrix}$
Discard all boxes with $P_c \le 0.6$

While there are any remaining boxes:

Pick the box with largest $P_c$ , output that as a prediction
Discard any remaining box with IoU $\ge 0.5$ with the box output in the previous step

2.7 YOLO Algorithm

把上面的组件合在一起，就是YOLO了。具体地，如果在 $3 \times 3$ 的网格上用2个anchor box对4个分类的预测结果大小应该是 $3 \times 3 \times 2 \times 9$ 。

a 其它图片均来自课程，侵删 ↩︎
Intersection over Union (IoU) for object detection ↩︎

Deep Learning Specialization 4: Convolutional Neural Networks - Week3 - Object Detection

Object Detection

1. 问题定义

2. 目标检测算法(YOLO)

2.1 Sliding windows detection

2.2 Convolutional Implementation of Sliding Windows

2.2.1 Turning FC Layer into convolutional layers

2.2.2 Convolution implementation of sliding windows

2.3 Bounding Box Predictions

2.4 Anchor Boxes

2.5 评价指标 Intersection over Union (IoU)2

2.6 Non-max suppression

2.7 YOLO Algorithm

相关推荐

2.5 评价指标 Intersection over Union (IoU)²