高级CNN架构

1. CNN和场景理解

There’s usually a single object per image that a network is expected to classify . But in real world, we’re often faced with much more complex visual scenes , scenes with many overlapping objects .We can see and classify many objects at a time ,and even estimate thing like the distance between objects in a scene .

In this lesson ,we will see different kinds of CNN architectures and see how they’ve evolved over time .

Specifically , we’ll look at models that detect multiple objects in a scene, like Faster R-CNN and YOLO.Two kinds of network that can look at an image ,break it up to smaller regions ,(思想是将整一副图片分成小块),and label each region with a class ,so that a variable number of objects in a given image can be localized and labeled .

Later on in the course , you will also learn about recurrent neural networks that allow us to process and generate sequences of data ,such as a sequence of image frames or a sequence of words , which is useful for should you want to describe visual scenes as in the case of automatic image captioning

So ,let’s start by looking at some complex tasks that CNNs can be applied to .

2.不止分类

First,we;ll take about classifying one object in an image and localizing it . This is typically done bby placing a bounding box around the object .

Finding the location of an object in an image and more generally being able to analyze a image by breaking it up into smaller bounded regions is the key to creating a model that can classify multiple object in an image .

高级CNN架构

We’ll build up to learning about **region-based CNN’s **like the faster R-CNN model which analyzes different cropped areas(裁剪区域) of a single input image , decides which regions correspond to objects and then performs classification as usual .

3.分类和定位

To find the bound box above ,we can use a lot of the same structure as in a typically classification CNN.

One way to perform localization is to first put a given image through a series of convolutional and pooling layers and create a feature vector value for a image .

You keep the same fully- connected layers to perform classification and add another fully connected layer attached to the feature vector whose job is to predict the location and size of a bounding box .

高级CNN架构
In this case , we assume that input image not only has an associated true label ,but that it also has a true bounding box .This way we can train our network by comparing the predicted and true values for both the classes and bounding boxes .

Now we know how to use something like cross entropy loss to measure the performance of a classification model .Cross entropy operate on probabilities with value between 0 and 1 .

**But for the bounding box ,we need something different . A function that measures the error between our predicted bounding box and a true bounding box . **Next , you’ll see what kinds of loss functions appropriate for a regression problem like this that compare quantities than scores .

4.边界框和递归

Because Cross-entropy loss decreases as the predicted class which has some uncertainty associate with it gets closer and closer to the true class label .

But ,when we look at comparing a set of points , say locations or points on a face or points define a specific region image , we need a loss function that measures the similarity between these coordinate values .**This is not a classification problem ,this is about regression problem **

We can’t really say whether a point is accurate or not .We can only evaluate quantities by looking at something like the MSE(mean square error )between them

高级CNN架构

1.L1 Loss You may consider that L1 loss become negligible for small error values

2.MSE Loss MSE responds the most to largest error ,so it may end up amplifying errors that are big bur not frequent ,also known as outliers .

3.Smooth L1 try to combine the best aspect of MSE and L1

边界框的延伸

要预测边界框，我们训练模型接受图像作为输入，并输出坐标值：(x, y, w, h)。这种模型可以扩展到解决输出为坐标值的任何问题！其中一个示例便是人体姿态估计。

高级CNN架构

人体姿态估计点。

在上述示例中，我们看到可以用身体关节上的 14 个点追踪人体姿态。

加权损失函数

你可能会疑问：如何训练具有两种不同输出（类别和边界框）和不同输出损失函数的网络？

我们知道，在这种情况下，我们使用类别交叉熵计算预测和真实类别的损失，并使用递归损失（例如 Smooth L1 损失）比较预测边界框和真实边界框。但是，我们需要训练整个网络使用一个损失，可以将这些损失结合到一起吗？

可以通过几种方式用多个损失函数训练网络，在实际操作中，我们经常使用分类和递归损失的加权和（例如 0.5*cross_entropy_loss + 0.5*L1_loss）；结果是单个误差值，我们可以对其进行反向传播。这样就会引入一个超参数：损失权重。我们希望对每个损失设定权重，使这些损失达到平衡和并有效结合，在研究中我们经常引入另一个正则项来帮助决定能给出最佳损失函数组合的权重。

5. 练习：损失值

请参阅 MSE 损失文档。对于真实坐标 (2, 5) 和预测坐标 (2.5, 3)，这两个点之间的 MSE 损失是多少？（你可以假定取平均的默认值。）

2.125

习题 2/2

请参阅 Smooth L1 损失的文档。对于真实坐标 (2, 5) 和预测坐标 (2.5, 3)，这两个点之间的 Smooth L1 损失是多少？

0.8125

6.区域建议

Now you’ve seen how to locate one object in an image by generating a bounding box around that object . But what if there are multiple projects in an image ,how can you train a network to detect all of them ?

高级CNN架构

One approach could be try to simplify this input image and split it into two different regions . ,each of which only contains one object . Then we can proceed in the same way as before .

The real challenge here is that there’s a variable output . You don’t know ahead of time how mant objects are going to be in a given image and CNNs and most neural network have a defined unchanging output size .

So to detect a variable amount of objects in any image , you first muust break that image up into smaller regions and produce bounding boxes and class labels for one region and one object at a time . We’ll learn about techniques for finding these regions shortly.

高级CNN架构

So how can we go about breaking up an image into regions .We know that we want these regions to correspond to different objects in the image and we don’t want to miss an objects .

We could just make a bunch of cropped region to make sure we don’t miss anything . This would mean defining a small sliding window and passing it over the entire image using some value for stride to create mini different crops of the original input image .
高级CNN架构
Then for each cropped region,we can put it through a CNN and perform classification

However ,this approach produces a huge amount of cropped images and is extremely time-intensive . Also ,in this case , most of cropped images don’t even contain objects . So how can you better to choose these cropped region , especially when objects vary in size and localization .

Next ,I want you to think about you might improve this region selection process ,.You want to make sure not to miss any objects but you also don’t want to put a huge number of cropped regions through a CNN.

高级CNN架构

对于上述图像，你认为应该如何选择最佳建议区域；好的区域需满足什么条件？

延伸思考

我们要分析的区域是其中有完整对象的区域。我们希望删除包含图像背景或只包含部分对象的区域。因此，建议采用两种常用方法：1. 使用特征提取或聚类算法（例如 K 均值）识别相似的区域，你已经见过这种方法；这些方法应该能识别任何感兴趣区域。2. 向模型添加另一个层级，用于对这些区域进行二元分类并标记它们：对象或非对象；这样使我们能够丢弃任何非对象区域！

7.R-CNN(region CNN)

R-CNN 输出

R-CNN 是最不复杂的区域架构；但它是理解多对象识别算法原理的基础！对于每个输入 RoI，它会输出一个类别得分和边界框坐标。

R-CNN 将图像馈送到已经识别感兴趣区域 (RoI) 的 CNN。因为 RoI 具有各种尺寸，它们经常需要调整为标准尺寸，因为 CNN 通常要求输入图像的尺寸保持一致，并且是正方形。调整 RoI 大小后，R-CNN 架构针对每个图像挨个地处理这些区域，并输出：1. 一个类别标签，以及 2. 一个边界框（可能会稍微更正输入区域）。

R-CNN 生成边界框坐标，以降低定位误差；因此出现区域，但是它不一定会完美围绕给定对象，输出坐标 (x,y,w,h) 旨在完美地定位给定区域中的对象。
与其他模型不同，R-CNN 不会生成表示区域中是否有对象的置信度得分，而是生成一组类别得分，其中一个类别是“背景”。这样达到的目的差不多，例如如果某个区域的类别得分是 Pbackground = 0.10，则很有可能包含对象，但是如果 Pbackground = 0.90，则该区域很有可能不含对象。

To approach this goal and generate a good limited set of cropped regions , the idea of **region proposals(候选区域)was introduces . **

Region proposals give us a way to quickly look at an image and generate regions only for areas in which we think there may be an object .

We can use traditional computer vision techniques that detect thing like edges and textured bobs (纹理) to produce a set of regions in which objects are most likely to be found areas of similar textured .

These proposals often produce noisy non-object regions ,but they are also very likely to include the regions in which objects are located . So the noise is considered a worthwhile cost for not missing any objects .

So let’s see how this looks when incorporated into a CNN architecture .We can sue a region proposal algorithm to produce a limited set of cropped regions . Often called regions of interests or ROIs . And we put these regions through a classification CNN.

In these case ,we also include a class called background , that’s meant to capture any noisy region .

Now, the main shortcoming of this method is that is still time intensive because it require each cropped region go through an entire CNN before a class label can be produced .

8. Fast R-CNN

The next advancement in R-CNN come with FAST R-CNN architecture .Instead of processing each region of interest individually through classification CNN,this architecture runs the entire image through a classification CNN only once .

We still need to identify region of interest but instead of cropping the original image ,we project these proposals into the smaller feature map layer .Each region in the feature map correspond to a larger region in the original image . So we can grab selected regions in the feature map and feed them one by one into a fully connected layer that generates a class for each of these different regions .

高级CNN架构

Again ,we have to handle the variable sizes and these protections ,since layers further in the network are expecting input of a fixed size . So we do something called ROI pooling to warp these regions into a consistent size before giving them to a fully connected layer .

高级CNN架构
Now this network is faster than R-CNN but it’s still slow when faced with a test image for which it has to generate region proposals and it’s still looking at regions that do not contain object at all .

The next architecture aims to improve this region generation steps.

RoI 池化

要将感兴趣区域调整为相同的尺寸，以便进一步分析，某些网络会使用 RoI 池化。RoI 池化是网络中的额外层级，它会接受任何尺寸的矩形区域，并对该区域进行最大池化运算，使输出是固定的形状。下面是一个区域示例，其中的像素值被拆分为多个部分，并应用池化运算；具有以下值的部分：

[[0.85, 0.34, 0.76],
 [0.32, 0.74, 0.21]]

在池化之后将变成单个最大值：0.85。对图像这样划分之后，你可以看出矩形区域如何变成了更小的方形表示结果。

高级CNN架构

池化区域示例，摘自此信息来源（关于 RoI 池化[作者： Tomasz Grel]。

你可以在下面看到输入图像变成缩小的最大池化区域的完整流程。

高级CNN架构

摘自此信息来源（关于 RoI 池化）。

速度

Fast R-CNN 的训练速度比 R-CNN 快了 10 倍，因为对于给定图像，它只创建一次卷积层，然后进一步分析该层级。Fast R-CNN 在测试新图像时时间也更短！测试时间取决于创建区域建议所需的时间。

9. Faster R-CNN

We want to decrease the time it takes to form region proposals . F

aster CNN learn to come up with its own region proposal .

But this time it uses the produced feature map as input into a separate region proposal network .(作为输入传入单独的候选区域网络).So it predict its own regions from the feature produces inside the network . If an area in the feature map is rich in detected edges or other features ,it;s identified as a region of interest.Then this part of a network does a quick binaty classification .For each ROI it checks whether or not that region contain a object .

区域建议网络

你可能会疑问：Faster R-CNN 架构的区域建议部分到底是如何生成 RoI 的？

Faster R-CNN 中的区域建议网络行为方式和 YOLO 对象检测很相似，你将在下节课学习 YOLO。RPN 会查看最后卷积层的输出（一个特征图），并用滑动窗口方法进行潜在对象检测。它在特征图上滑动一个小的窗口（通常是 3x3），然后对于每个窗口，RPN：

使用一组定义好的锚点框生成多个可能的 RoI，每个 RoI 都被视为区域建议。锚点框是宽高比固定的方框，例如宽矮或高瘦方框。
对于每个建议，该网络都生成一个概率 Pc，将该区域分类为对象（或非对象）区域，并包含该对象的一组边界框坐标。
对象概率很低的区域（例如 Pc < 0.5）被丢弃。

训练区域建议网络

因为在此示例中，没有真实区域，如何训练区域建议网络？

原理是对于任何区域，可以检查它是否与任何真实对象重叠。即对于某个区域，如果我们要将此区域分类为对象或非对象区域，它属于哪个类别？对于重叠一部分对象的区域建议，我们应该表示该区域里有对象的概率很高，应保留该区域；如果其中有对象的概率很低，则丢弃该区域。

要详细了解区域选择，建议阅读此博文。

速度瓶颈

对于所有这些网络（包括 Faster R-CNN）来说，我们旨在通过缩短生成和判断区域建议的时间，加快对象检测模型的速度。你可能会疑问：有什么方法能完全取消这个建议步骤吗？在下个部分，我们将了解一种不依靠区域建议的方法！

练习题

为何速度很重要？

多对象检测包括定位和分类图像中的不同对象。某些模型为了准确率，放弃了速度。哪些应用在对象检测过程中需要速度很快？请选中所有适用项。

准确分类癌症组织和非癌症组织图像。
自动驾驶车辆行人检测。
识别一组图像中的人脸。
跟踪人脸，使相机能聚焦人脸。

实现 Faster R-CNN

如果你想查看此网络的实现代码，可以查看此同行评审版 Github 代码库。

高级CNN架构

高级CNN架构

1. CNN和场景理解

2.不止分类

3.分类和定位

4.边界框和递归

边界框的延伸

加权损失函数

5. 练习：损失值

习题 2/2

6.区域建议

延伸思考

7.R-CNN(region CNN)

R-CNN 输出

8. Fast R-CNN

RoI 池化

速度

9. Faster R-CNN

区域建议网络

训练区域建议网络

速度瓶颈

练习题

实现 Faster R-CNN

相关推荐