《Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition》翻译
Abstract—Existing deep convolutional neuralnetworks (CNNs) require a fixed-size (e.g., 224×224) input image. Thisrequirement is “artificial” and may reduce the recognition accuracy for theimages or sub-images of an arbitrary size/scale. In this work, we equipthe networks with another pooling strategy, “spatial pyramid pooling”, toeliminate the above requirement. The new network structure, calledSPP-net, can generate a fixed-length representation regardless of imagesize/scale. Pyramid pooling is also robust to object deformations. Withthese advantages, SPP-net should in general improve all CNN-basedimage classification methods. On the ImageNet 2012 dataset, we demonstratethat SPP-net boosts the accuracy of a variety of CNN architectures despitetheir different designs. On the Pascal VOC 2007 and Caltech101 datasets,SPP-net achieves state-of-the-art classification results using a singlefull-image representation and no fine-tuning.
The power of SPP-net is also significantin object detection. Using SPP-net, we compute the feature maps from theentire image only once, and then pool features in arbitrary regions(sub-images) to generate fixed-length representations for training thedetectors. This method avoids repeatedly computing the convolutional features.In processing test images, our method is 24-102×faster than the R-CNN method,while achieving better or comparable accuracy on Pascal VOC 2007.
In ImageNet Large Scale VisualRecognition Challenge (ILSVRC) 2014, our methods rank #2 in object detectionand #3 in image classification among all 38 teams. This manuscript alsointroduces the improvement made for this competition.
Index Terms—Convolutional NeuralNetworks, Spatial Pyramid Pooling, Image Classification, Object Detection
摘要
在ImageNet大规模视觉识别任务挑战(ILSVRC)2014上,我们的方法在物体检测上排名第2,在物体分类上排名第3,参赛的总共有38个组。本文也介绍了为了这个比赛所作的一些改进。
关键词:Convolutional NeuralNetworks, Spatial Pyramid Pooling, Image Classification, Object Detection
1 INTRODUCTION
We are witnessing a rapid, revolutionary change in ourvision community, mainly caused by deep convolutional neural networks (CNNs)[1] and the availability of large scale training data [2]. Deep-networks-based approacheshave recently been substantially improving upon the state of the art in imageclassification [3], [4], [5], [6], object detection [7], [8], [5],many other recognition tasks [9], [10], [11], [12],and even non-recognition tasks.
1.引言
我们看到计算机视觉领域正在经历飞速的变化,这一切得益于深度卷积神经网络(CNNs)[1]和大规模的训练数据的出现[2]。近来深度网络对图像分类 [3][4][5][6],物体检测 [7][8][5]和其他识别任务 [9][10][11][12],甚至很多非识别类任务上都表现出了明显的性能提升。
However, there is a technical issue in the training andtesting of the CNNs: the prevalent CNNs require a fixed input image size (e.g.,224×224), which limits both the aspect ratio and the scale of the input image.Whenapplied to images of arbitrary sizes, current methods mostly fit the inputimage to the fixed size, either via cropping [3], [4] or via warping [13], [7],asshown in Figure 1 (top). But the cropped region may not contain the entireobject, while the warped content may result in unwanted geometric distortion.Recognitionaccuracy can be compromised due to the content loss or distortion. Besides, apre-defined scale may not be suitable when object scales vary. Fixing inputsizes overlooks the issues involving scales.
然而,这些技术再训练和测试时都有一个问题,这些流行的CNNs都需要输入的图像尺寸是固定的(比如224×224),这限制了输入图像的长宽比和缩放尺度。当遇到任意尺寸的图像时,都是先将图像适应成固定尺寸,方法包括裁剪[3][4]和变形[13][7],如图1(上)所示。但裁剪会导致信息的丢失,变形会导致位置信息的扭曲,就会影响识别的精度。另外,一个预先定义好的尺寸在物体是缩放可变的时候就不适用了。固定输入大小忽略了涉及比例的问题。
So why do CNNs require a fixed input size? A CNN mainly consists of two parts: convolutional layers,and fully-connected layers that follow. The convolutional layers operate in a sliding-window manner and output feature maps which represent the spatial arrangement of the activations (Figure2). In fact, convolutional layers do not require a fixed image size and cangenerate feature maps of any sizes. On the other hand, the fully-connected layers need to have fixed-size/length input by their definition. Hence, thefixed-size constraint comes only from the fully-connected layers, which existat a deeper stage of the network.
那么为什么CNNs需要一个固定的输入尺寸呢?CNN主要由两部分组成,卷积层和其后的全连接层。卷积部分通过滑窗进行计算,并输出代表**的空间排布的特征图(feature map)(图2)。事实上,卷积并不需要固定的图像尺寸,他可以产生任意尺寸的特征图。而另一方面,根据定义,全连接层则需要固定的尺寸输入。因此固定尺寸的问题来源于全连接层,也是网络的最后阶段。
In this paper, we introduce a spatial pyramid pooling(SPP)[14], [15] layer to remove the fixed-size constraint of the network.Specifically, we add an SPP layer on top of the last convolutional layer. The SPP layer pools the features and generates fixed-length outputs, which are then fed into the fully-connected layers (or other classifiers). In other words,we perform some information “aggregation” at a deeper stage of the networkhierarchy (between convolutional layers and fully-connected layers) to avoid theneed for cropping or warping at the beginning.Figure 1 (bottom) shows thechange of the network architecture by introducing the SPP layer. We call the newnetwork structure SPP-net.
本文引入一种空间金字塔池化( spatial pyramid pooling,SPP)层以移除对网络固定尺寸的限制。尤其是,将SPP层放在最后一个卷积层之后。SPP层对特征进行池化,并产生固定长度的输出,这个输出再喂给全连接层(或其他分类器)。换句话说,在网络层次的较后阶段(也就是卷积层和全连接层之间)进行某种信息“汇总”,可以避免在最开始的时候就进行裁剪或变形。图1(下)展示了引入SPP层之后的网络结构变化。我们称这种新型的网络结构为SPP-net。
Spatial pyramid pooling [14], [15] (popularly known as spatial pyramid matching or SPM [15]), as an extension of the Bag-of-Words (BoW) model [16],is one of the most successful methods in computer vision. It partitions the image into divisions from finer to coarser levels, and aggregates local features in them. SPP has long been a key component in the leading and competition-winning systems for classification (e.g., [17], [18], [19]) and detection (e.g., [20])before the recent prevalence of CNNs. Nevertheless, SPP has not been considered in the context of CNNs.We note that SPP has several remarkable properties for deep CNNs: 1) SPP is able to generate a fixed-length output regardless of the input size, while the sliding window pooling used in the previous deep networks [3] cannot; 2) SPP uses multi-level spatial bins, while the sliding window pooling uses only a single window size. Multi-level pooling has been shown to be robust to object deformations [15]; 3) SPP can pool features extracted at variable scales thanks to the flexibility of input scales. Throughexperiments we show that all these factors elevate the recognition accuracy of deep networks.
空间金字塔池化[14][15](普遍称谓:空间金字塔匹配spatial pyramid matching, SPM[15]),是一种词袋(Bag-of-Words, BoW)模型的扩展。词袋模型是计算机视觉领域最成功的方法之一。它将图像切分成粗糙到精细各种级别,然后整合其中的局部特征。在CNN之前,SPP一直是各大分类比赛[17][18][19]和检测比赛(比如[20])的冠军系统中的核心组件。对深度CNNs而言,SPP有几个突出的优点:1)SPP能在输入尺寸任意的情况下产生固定大小的输出,而以前的深度网络[3]中的滑窗池化(sliding window pooling)则不能;2)SPP使用了多级别的空间箱(bin),而滑窗池化则只用了一个窗口尺寸。多级池化对于物体的变形十分鲁棒[15];3)由于其对输入的灵活性,SPP可以池化从各种尺度抽取出来的特征。通过实验,我们将展示影响深度网络最终识别精度的所有这些因素。
SPP-net not only makes it possible to generate representations from arbitrarily sized images/windows for testing, but also allows us to feedimages with varying sizes or scales during training. Training with variable-size images increases scale-invariance and reduces over-fitting. We develop a simple multi-size training method. For a single network to accept variable inputsizes, we approximate it by multiple networks that share all parameters, while each of these networks is trained using a fixed input size. In each epoch we train the network with a given input size, and switch to another input size for the next epoch. Experiments show that this multi-size training converges just as the traditional single-size training,and leads to better testing accuracy.
SPP-net不仅仅让测试阶段允许任意尺寸的输入能够产生表示(representations),也允许训练阶段的图像可以有各种尺寸和缩放尺度。使用各种尺寸的图像进行训练可以提高缩放不变性,以及减少过拟合。我们开发了一个简单的多尺度训练方法。为了实现一个单一的能够接受各种输入尺寸的网络,我们先使用分别训练固定输入尺寸的多个网络,这些网络之间共享权重(Parameters),然后再一起来代表这个单一网络(译者注:具体代表方式没有说清楚,看后面怎么说吧)。每个epoch,我们针对一个给定的输入尺寸进行网络训练,然后在下一个epoch再切换到另一个尺寸。实验表明,这种多尺度训练和传统的单一尺度训练一样可以瘦脸,并且能达到更好的测试精度。The advantages of SPP are orthogonal to the specific CNN designs. In a series of controlled experiments on the ImageNet 2012 dataset, we demonstrate that SPP improves four different CNN architectures in existing publications[3], [4], [5] (or their modifications), over the no-SPP counterparts. These architectures have various filter numbers/sizes, strides, depths, or other designs.It is thus reasonable for us to conjecture that SPP should improve more sophisticated(deeper and larger) convolutional architectures. SPP-net also shows state-of-the-art classification results on Caltech101 [21] and Pascal VOC 2007[22] using only a single full-image representation and no fine-tuning.
SPP的优点是与各类CNN设计是正交的。通过在ImageNet2012数据集上进行一系列可控的实验,我们发现SPP对[3][4][5]这些不同的CNN架构都有提升。这些架构有不同的特征数量、尺寸、滑动距离(strides)、深度或其他的设计。所以我们有理由推测SPP可以帮助提升更多复杂的(更大、更深)的卷积架构。SPP-net也做到了 Caltech101 [21]和Pascal VOC 2007 [22]上的最好结果,而只使用了一个全图像表示,且没有调优。
SPP-net also shows great strength in object detection. In the leading object detection method R-CNN[7],the features from candidate windows are extracted via deep convolutional networks. This method shows remarkable detection accuracy on both the VOC and ImageNet datasets. But the feature computation in R-CNN is time-consuming, because itrepeatedly applies the deep convolutional networks to the raw pixels of thousands of warped regions per image. In this paper, we show that we can runthe convolutional layers only once on the entireimage (regardless of the number of windows), and then extract features by SPP-net on the feature maps. This method yields a speed up of over one hundred times over R-CNN. Note that training/running a detector on the feature maps(rather than image regions) is actually a more popular idea [23], [24], [20],[5]. But SPP-net inherits the power of the deep CNN feature maps and also the flexibility of SPP on arbitrary window sizes, which leads to outstanding accuracy and efficiency. In our experiment, the SPP-net-based system (built upon the R-CNNpipeline) computes features 24-102×faster than R-CNN, while has better or comparable accuracy.With the recent fast proposal method of EdgeBoxes[25], oursystem takes 0.5 seconds processing an image(including all steps). This makes our method practical for real-world applications.
在图像检测方面,SPP-net也表现优异。目前领先的方法是R-CNN[7],候选窗口的特征是借助深度神经网络进行抽取的。此方法在VOC和ImageNet数据集上都表现出了出色的检测精度。但R-CNN的特征计算十分耗时,因为他对每张图片中的上千个变形后的区域的像素反复调用CNN。本文中,我们展示了我们只需要在整张图片上运行一次卷积网络层(不关心窗口的数量),然后再使用SPP-net在特征图上抽取特征。这个方法缩减了上百倍的耗时。在特征图(而不是图像区域)上训练和运行检测器是一个很受欢迎的想法[23][24][20][5]。但SPP-net延续了深度CNN特征图的优势,也结合了SPP兼容任意窗口大小的灵活性,所以做到了出色的精度和效率。我们的实验中,基于SPP-net的系统(建立在R-CNN流水线上)比R-CNN计算特征要快24-120倍,而精度却更高。结合最新的推荐方法EdgeBoxes[25],我们的系统达到了每张图片处理0.5s的速度(全部步骤)。这使得我们的方法变得更加实用。
A preliminary version of this manuscript has been published in ECCV 2014. Based on this work, we attended the competition of ILSVRC 2014[26], and ranked #2 in object detection and #3 in image classification (bothare provided-data-only tracks) among all 38 teams. There are a few modifications made for ILSVRC 2014. We show that the SPP-nets can boost various networks that are deeper and larger (Sec. 3.1.2-3.1.4) over the no-SPP counterparts.Further, driven by our detection framework, we find that multi-view testing onfeature maps with flexibly located/sized windows (Sec. 3.1.5) can increase the classification accuracy. This manuscript also provides the details of these modifications.
We have released the code to facilitate future research(http://research.microsoft.com/ en-us/ um/ people/ kahe/).
本论文的一个早先版本发布在ECCV2014上。基于这个工作,我们参加了ILSVRC 2014 [26],在38个团队中,取得了物体检测第2名和图像分类第3名的成绩。针对ILSVRC 2014我们也做了很多修改。我们将展示SPP-nets可以将更深、更大的网络的性能显著提升。进一步,受检测框架驱动,我们发现借助灵活尺寸窗口对特征图进行多视角测试可以显著提高分类精度。本文对这些改动做了更加详细的说明。
另外,我们将代码放在了以方便大家研究(http://research.microsoft.com/en-us/um/people/kahe/,译者注:已失效)
2 DEEP NETWORKS WITH SPATIAL PYRA-MID POOLING
2.1 Convolutional Layers and Feature Maps Consider thepopular seven-layer architectures [3], [4].The first five layers are convolutional, some of which are followed by pooling layers. These pooling layers can also be considered as “convolutional”, in the sense that they are using sliding windows. The last two layers are fully connected, with an N-way softmax as the output, where N is the number of categories.
2. 基于空间金字塔池化的深度网络
2.1 卷积层和特征图
在颇受欢迎的七层架构中[3][4]中,前五层是卷积层,其中一些后面跟着池化层。从他们也使用滑窗的角度来看,这些池化层也可以认为是“卷积的”。最后两层是全连接的,跟着一个N路softmax输出,其中N是类别的数量。
The deep network described above needs a fixed imagesize. However, we notice that the requirement of fixed sizes is only due to the fully-connected layers that demand fixed-length vectors as inputs. On the other hand, the convolutional layers accept inputs of arbitrary sizes. The convolutional layers use sliding filters, and their outputs have roughly the same aspect ratio as the inputs. These outputs are known as feature maps [1] -they involve not only the strength of the responses, but also their spatial positions.
上述的深度网络需要一个固定大小的图像尺寸。然而,我们注意到,固定尺寸的要求仅仅是因为全连接层的存在导致的。另一方面,卷积层使用滑动的特征过滤器,它们的输出基本保持了原始输入的比例关系。它们的输出就是特征图[1]-它们不仅涉及响应的强度,还包括空间位置。In Figure 2, we visualize some feature maps. They are generated by some filters of the conv layer. Figure 2(c) shows the strongest activated images of these filters in the ImageNet dataset. We see a filter can be activated by some semantic content. For example, the 55-th filter (Figure 2,bottom left) is most activated by a circle shape; the 66-th filter (Figure 2,top right) is most activated by a ∧-shape; and the 118-th filter (Figure 2, bottomright) is most activated by a ∨-shape.These shapes in the input images (Figure 2(a))activate the feature maps at the corresponding positions (the arrows in Figure2).
图2中,我们可视化了一些特征图。这些特征图来自于conv5层的一些过滤器。图2(c)显示了ImageNet数据集中**最强的若干图像。可以看到一个过滤器能够被一些语义内容**。例如,第55个过滤器(图2,左下)对圆形十分敏感;第66层(图2,右上)对^形状特别敏感;第118个过滤器(图2,右下)对∨形状非常敏感。这些输入图像中的形状会**相应位置的特征图(图2中的箭头)。
It is worth noticing that we generate the feature maps in Figure 2 without fixing the input size. These feature maps generated by deep convolutional layers are analogous to the feature maps in traditional methods[27], [28]. In those methods, SIFT vectors [29] or image patches [28] aredensely extracted and then encoded,e.g., by vector quantization [16], [15],[30],sparse coding [17], [18], or Fisher kernels [19].These encoded features consist of the feature maps,and are then pooled by Bag-of-Words (BoW) [16] or spatialpyramids [14], [15]. Analogously, the deep convolutional features can be pooled in a similar way.
值得注意的是,图2中生成的特征图并没有固定输入尺寸。深度卷积层生成的特征图和传统方法[27][28]中的特征图很相似。这些传统方法中,SIFT向量[29]或图像碎片[28]被密集地抽取出来,在通过矢量量化[16][15][30],稀疏化[17][18]或Fisher核函数[19]进行编码。这些编码后的特征构成了特征图,然后通过词袋(BoW)[16]或空间金字塔[14][15]进行池化。类似的深度卷积的特征也可以这样做。
2.2 The Spatial Pyramid Pooling Layer
The convolutional layers accept arbitrary input sizes,but they produce outputs of variable sizes. The classifiers (SVM/softmax) or fully-connected layers require fixed-length vectors. Such vectors can be generated by the Bag-of-Words (BoW) approach [16] that pools the features together. Spatial pyramid pooling [14], [15] improves BoW in that it can maintain spatial information by pooling in local spatial bins. These spatialbins have sizes proportional to the image size, so the number of bins is fixed regardless of the image size. This is in contrast to the sliding window pooling of the previous deep networks [3], where the number of sliding windows depends on the input size.
2.2 空间金字塔池化层
To adopt the deep network for images of arbitrary sizes, we replace the last pooling layer (e.g.,pool5, after the last convolutional layer) with a spatial pyramid pooling layer.Figure 3 illustrates our method.In each spatial bin, we pool the responses of each filter (throughout this paper we use max pooling).The outputs of the spatial pyramid pooling are kM-dimensional vectors with the number of bins denoted as M(k is the number of filters in the last convolutional layer). The fixed-dimensional vectors are the input to the fully-connected layer.
为了让我们的神经网络适应任意尺寸的图像输入,我们用一个空间金字塔池化层替换掉了最优一个池化层(最后一个卷积层之后的pool5)。图3示例了这种方法。在每个空间块中,我们池化每个过滤器的响应(本文中采用了最大池化法)。空间金字塔的输出是一个kM维向量,M代表块的数量,k代表最后一层卷积层的过滤器的数量。这个固定维度的向量就是全连接层的输入。
With spatial pyramid pooling, the input image can 4 beof any sizes. This not only allows arbitrary aspect ratios, but also allows arbitrary scales. We can resize the input image to any scale (e.g.,min(w,h)=180,224,...) and apply the same deep network. When the input image is at different scales, the network (with the same filter sizes) will extract features at different scales. The scales play important roles in traditional methods,e.g.,the SIFT vectors are often extracted at multiple scales [29], [27] (determinedby the sizes of the patches and Gaussian filters). We will show that the scales are also important for the accuracy of deep networks.
有了空间金字塔池化,输入图像就可以是任意尺寸了。不但允许任意比例关系,而且支持任意缩放尺度。我们也可以将输入图像缩放到任意尺度(例如min(w;h)=180,224,…)并且使用同一个深度网络。当输入图像处于不同的空间尺度时,带有相同大小卷积核的网络就可以在不同的尺度上抽取特征。跨多个尺度在传统方法中十分重要,比如SIFT向量就经常在多个尺度上进行抽取[29][27](受碎片和高斯过滤器的大小所决定)。我们接下来会说明多尺度在深度网络精度方面的重要作用。Interestingly, the coarsest pyramid level has a single bin that covers the entire image. This is in fact a “global pooling” operation,which is also investigated in several concurrent works. In [31], [32] a global average pooling is used to reduce the model size and also reduce overfitting; in [33],a global average pooling is used on the testing stage after all fc layers to improve accuracy; in [34], a global max pooling is used for weakly supervised object recognition. The global pooling operation corresponds to the traditional Bag-of-Words method.
有趣的是,粗糙的金字塔级别只有一个块,覆盖了整张图像。这就是一个全局池化操作,当前有很多正在进行的工作正在研究它。[33]中,一个放在全连接层之后的全局平均池化被用来提高测试阶段的精确度;[34]中,一个全局最大池化用于弱监督物体识别。全局池化操作相当于传统的词袋方法。
2.3 Training the Network
Theoretically, the above network structure can be trainedwith standard back-propagation [1], regardless of the input image size. But inpractice the GPU implementations (such as cuda-convnet[3] and Caffe[35]) are preferably run on fixed input images. Next we describe our training solution that takes advantage of these GPU implementations while still preserving the spatial pyramid pooling behaviors.
2.3 网络的训练
3 用于图像分类的SPP-NET
3.1 ImageNet 2012分类实验
3.1.1 基准网络架构
3.1.2 多层次池化提升准确度
3.1.3 多尺寸训练提升准确度
3.1.5 特征图上的多视图测试
3.2 Experiments on VOC 2007 Classification
3.3 Experiments on Caltech101
4 SPP-NET用于物体检测
4.1 检测算法
4.2 检测结果
表10中,我们进一步使用相同预训练的SPPnet模型(ZF-5)和R-CNN进行比较。本例中,我们的方法和R-CNN有相当的平均成绩。R-CNN的结果是通过预训练模型进行提升的。这是因为ZF-5比AlexNet有更好的架构,而且SPPnet是多层次池化(如果使用非SPP的ZF-5,R-CNN的结果就会下降)。表11表明了每个类别的结果。表也包含了其他方法。选择性搜索(SS)[20]在SIFT特征图上应用空间金字塔匹配。DPM[23]和Regionlet[39]都是基于HOG特征的[24]。Regionlet方法通过结合包含conv5的同步特征可以提升到46.1%。DetectorNet[40]训练一个深度网络,可以输出像素级的对象遮罩。这个方法仅仅需要对整张图片应用深度网络一次,和我们的方法一样。但他们的方法mAP比较低(30.5%)。