论文题目：Fully Convolutional Networks for Semantic Segmentation
论文来源:Fully Convolutional Networks for Semantic Segmentation_2015_CVPR
翻译人：[email protected]实验室

Fully Convolutional Networks for Semantic Segmentation

Jonathan Long∗ Evan Shelhamer∗ Trevor Darrell UC Berkeley {jonlong,shelhamer,trevor}@cs.berkeley.edu

用于语义分割的全卷积网络

Jonathan Long∗ Evan Shelhamer∗ Trevor Darrell UC Berkeley {jonlong,shelhamer,trevor}@cs.berkeley.edu

Abstract

Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixelsto-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet [20], the VGG net [31], and GoogLeNet [32]) into fully convolutional networks and transfer their learned representations by fine-tuning [3] to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional net-work achieves stateof-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes less than one fifth of a second for a typical image.

摘要

卷积网络是可以产生特征层次的强大可视化模型。我们表明，卷积网络本身，经过端到端，像素到像素的训练，在语义分割方面超过了最先进的水平。我们的主要见解是建立“ 全卷积”网络，该网络可接受任意大小的输入，并通过有效的推理和学习产生相应大小的输出。我们定义并详细说明了完全卷积网络的空间，解释了它们在空间密集预测任务中的应用，并阐述了与先前模型的联系。我们将当代分类网络(AlexNet [20]、VGG网络[31]和GoogLeNet [32])改造成完全卷积网络，并通过微调[3]将它们的学习表示转移到分割任务中。然后，我们定义了一个跳跃结构，它将来自深度粗糙层的语义信息与来自浅层精细层的外观信息相结合，以生成准确和详细的分割。我们的全卷积网络实现了对PASCAL VOC(相对于2012年62.2%的平均IU改进率为20%)、NYUDv2和SIFT Flow的最先进的分割，而对于典型的图像，推断时间不到五分之一秒。

1. 解决了输入大小尺寸限制问题
2. 将经典分类网络改造成全卷积网络，开创了语义分割的先河，实现了像素级别（end-to-end）的分类预测
3. 实现了对PASCAL VOC、NYUDv2和SIFT Flow的最先进的分割
4. 主要技术贡献（卷积化、跳跃连接、反卷积）

1.Introduction

Convolutional networks are driving advances in recognition. Convnets are not only improving for whole-image classification [20, 31, 32], but also making progress on local tasks with structured output. These include advances in bounding box object detection [29, 10, 17], part and keypoint prediction [39, 24], and local correspondence [24, 8].

The natural next step in the progression from coarse to fine inference is to make a prediction at every pixel. Prior approaches have used convnets for semantic segmentation [27, 2, 7, 28, 15, 13, 9], in which each pixel is labeled with the class of its enclosing object or region, but with shortcomings that this work addresses.
【论文翻译】Fully Convolutional Networks for Semantic Segmentation
We show that a fully convolutional network (FCN) trained end-to-end, pixels-to-pixels on semantic segmentation exceeds the state-of-the-art without further machinery. To our knowledge, this is the first work to train FCNs end-to-end (1) for pixelwise prediction and (2) from supervised pre-training. Fully convolutional versions of existing networks predict dense outputs from arbitrary-sized inputs. Both learning and inference are performed whole-image-ata-time by dense feedforward computation and backpropagation. In-network upsampling layers enable pixelwise prediction and learning in nets with subsampled pooling.

This method is efficient, both asymptotically and absolutely, and precludes the need for the compli-cations in other works. Patchwise training is common [27, 2, 7, 28, 9], but lacks the efficiency of fully convolutional training. Our approach does not make use of pre- and post-processing complications, including superpixels [7, 15], proposals [15, 13], or post-hoc refinement by random fields or local class-ifiers [7, 15]. Our model transfers recent success in classification [20, 31, 32] to dense prediction by reinterpreting classification nets as fully convolutional and fine-tuning from their learned representa-tions. In contrast, previous works haveappliedsmallconvnetswithoutsupervisedpre-training [7, 28, 27].

Semantic segmentation faces an inherent tension between semantics and location: global information resolves what while local information resolves where. Deep feature hierarchies encode location and semantics in a nonlinear local-to-global pyramid. We define a skip architecture to take advantage of this feature spectrum that combines deep, coarse, semantic information and shallow, fine, appearance information in Section 4.2 (see Figure 3).

In the next section, we review related work on deep classification nets, FCNs, and recent approaches to semantic segmentation using convnets. The following sections explain FCN design and dense prediction tradeoffs, introduce our architecture with in-network upsampling and multilayer combinations, and describe our experimental framework. Finally, we demonstrate state-of-the-art results on PASCAL VOC 2011-2, NYUDv2, and SIFT Flow.

1.引言

卷积网络正在推动识别技术的进步。卷积网络不仅在改善了整体图像分类[20，31，32]，而且还在具有结构化输出的局部任务上也取得了进展。这些进展包括边界框目标检测[29，10，17]，部分和关键点预测[39，24]，以及局部对应[24，8]。

从粗略推断到精细推理，很自然下一步是对每个像素进行预测。先前的方法已经使用了卷积网络用于语义分割[27，2，7，28，15，13，9]，其中每个像素都用其包围的对象或区域的类别来标记，但是存在该工作要解决的缺点。
【论文翻译】Fully Convolutional Networks for Semantic Segmentation
图中21代表PASCAL VOC数据集P，总共 20 个小类（加背景 21 类）

我们表明了在语义分割上，经过端到端，像素到像素训练的完全卷积网络（FCN）超过了最新技术，而无需其他机制。据我们所知，这是第一个从（2）有监督的预训练，端到端训练地FCN（1）用于像素预测。现有网络的完全卷积版本可以预测来自任意大小输入的密集输出。学习和推理都是通过密集的前馈计算和反向传播在整个图像上进行的。网络内上采样层通过子采样池化来实现网络中的像素预测和学习。

这种方法在渐近性和绝对性两方面都是有效的，并且排除了对其他工作中的复杂性的需要。逐块训练是很常见的[27,2,8,28,11]，但缺乏完全卷积训练的效率。我们的方法没有利用前后处理的复杂性，包括超像素[8,16]，建议[16,14]，或通过随机字段或局部分类器进行事后细化[8,16]。我们的模型通过将分类网络重新解释为完全卷积并从其学习的表示中进行微调，将最近在分类任务[19,31,32]中取得的成功转移到密集预测任务。相比之下，以前的工作是在没有经过有监督的预训练的情况下应用了小型卷积网络[7，28，27]。

语义分割面临着语义和位置之间的固有矛盾：全局信息解决了什么，而局部信息解决了什么。深度特征层次结构以非线性的局部到全局金字塔形式对位置和语义进行编码。在4.2节中，我们定义了一个跳跃结构来充分利用这种结合了深层，粗略，语义信息和浅层，精细，外观信息的特征谱（请参见图3）。

在下一节中，我们将回顾有关深度分类网，FCN和使用卷积网络进行语义分割的最新方法的相关工作。以下各节介绍了FCN设计和密集的预测权衡，介绍了我们的具有网络内上采样和多层组合的架构，并描述了我们的实验框架。最后，我们演示了PASCAL VOC 2011-2，NYUDv2和SIFT Flow的最新结果。

介绍卷积的概念，以及如何过渡到要进行像素预测的一个想法，然后介绍了FCN主要的一些贡献，包括可以任意大小的输入，skip结构等，以及最后的实验结果。

2.Related work

Our approach draws on recent successes of deep nets for image classification [20, 31, 32] and transfer learning [3, 38]. Transfer was first demonstrated on various visual recognition tasks [3, 38], then on detection, and on both instance and semantic segmentation in hybrid proposalclassifier models [10, 15, 13]. We now re-architect and finetune classification nets to direct, dense prediction of semantic seg-mentation. We chart the space of FCNs and situate prior models, both historical and recent, in this framework.

Fully convolutional networks To our knowledge, the idea of extending a convnet to arbitrary-sized inputs first appeared in Matan et al. [26], which extended the classic LeNet [21] to recognize strings of digits. Because their net was limited to one-dimensional input strings, Matan et al. used Viterbi decoding to obtain their outputs. Wolf and Platt [37] expand convnet outputs to 2-dimensional maps of detection scores for the four corners of postal address blocks. Both of these historical works do inference and learning fully convolutionally for detection. Ning et al. [27] define a convnet for coarse multiclass segmentation of C. elegans tissues with fully convolutional inference.

Fully convolutional computation has also been exploited in the present era of many-layered nets. Sliding window detection by Sermanet et al. [29], semantic segmentation by Pinheiro and Collobert [28], and image restoration by Eigen et al. [4] do fully convolutional inference. Fully convolutional training is rare, but used effectively by Tompson et al. [35] to learn an end-to-end part detector and spatial model for pose estimation, although they do not exposit on or analyze this method.

Alternatively, He et al. [17] discard the nonconvolutional portion of classification nets to make a feature extractor. They combine proposals and spatial pyramid pooling to yield a localized, fixed-length feature for classification. While fast and effective, this hybrid model cannot be learned end-to-end.

Dense prediction with convnets Several recent works have applied convnets to dense prediction problems, including semantic segmentation by Ning et al. [27], Farabet et al.[7], and Pinheiro and Collobert [28]; boundary prediction for electron microscopy by Ciresan et al. [2] and for natural images by a hybrid convnet/nearest neighbor model by Ganin and Lempitsky [9]; and image restoration and depth estimation by Eigen et al. [4, 5]. Common elements of these approaches include
• small models restricting capacity and receptive fields;
• patchwise training [27, 2, 7, 28, 9];
• post-processing by superpixel projection, random field regularization, filtering, or local classification [7, 2, 9];
• input shifting and output interlacing for dense output [29, 28, 9];
• multi-scale pyramid processing [7, 28, 9];
• saturating tanh nonlinearities [7, 4, 28]; and
• ensembles [2, 9],
whereas our method does without this machinery. However, we do study patchwise training 3.4 and “shift-and-stitch” dense output 3.2 from the perspective of FCNs. We also discuss in-network upsamp-ling 3.3, of which the fully connected prediction by Eigen et al. [5] is a special case.

Unlike these existing methods, we adapt and extend deep classification architectures, using image classification as supervised pre-training, and fine-tune fully convolutionally to learn simply and eff-iciently from whole image inputs and whole image ground thruths.

Hariharan et al. [15] and Gupta et al. [13] likewise adapt deep classification nets to semantic segmen-tation, but do so in hybrid proposal-classifier models. These approaches fine-tune an R-CNN system [10] by sampling bounding boxes and/or region proposals for detection, semantic segmentation, and instance segmentation. Neither method is learned end-to-end. They achieve state-of-the-art segmen-tation results on PASCAL VOC and NYUDv2 respectively, so we directly compare our standalone, end-to-end FCN to their semantic segmentation results in Section 5.

We fuse features across layers to defineanonlinearlocalto-global representation that we tune end-to-end. In contemporary work Hariharan et al. [16] also use multiple layers in their hybrid model for se-mantic segmentation.

2.相关工作

我们的方法借鉴了最近成功的用于图像分类[20, 31, 32]和迁移学习[3,38]的深度网络。迁移首先在各种视觉识别任务[3,38]，然后是检测，以及混合提议分类器模型中的实例和语义分割任务[10,15,13]上进行了演示。我们现在重新构建和微调分类网络，来直接，密集地预测语义分割。我们绘制了FCN的空间，并在此框架中放置了历史和近期的先前模型。

全卷积网络 据我们所知，Matan等人首先提出了将一个卷积网络扩展到任意大小的输入的想法。 [26]，它扩展了classicLeNet [21]来识别数字串。因为他们的网络被限制为一维输入字符串，所以Matan等人使用Viterbi解码来获得它们的输出。Wolf和Platt [37]将卷积网络输出扩展为邮政地址块四个角的检测分数的二维图。这两个历史工作都是通过完全卷积进行推理和学习，以便进行检测。宁等人 [27]定义了一个卷积网络，通过完全卷积推理对秀丽隐杆线虫组织进行粗多类分割。

在当前的多层网络时代，全卷积计算也已经得到了利用。Sermanet等人的滑动窗口检测 [29]，Pinheiro和Collobert [28]的语义分割，以及Eigen等人的图像恢复 [4]都做了全卷积推理。全卷积训练很少见，但Tompson等人有效地使用了它 [35]来学习一个端到端的部分探测器和用于姿势估计的空间模型，尽管他们没有对这个方法进行解释或分析。

或者，He等人 [17]丢弃分类网络的非卷积部分来制作特征提取器。它们结合了建议和空间金字塔池，以产生用于一个局部化，固定长度特征的分类。虽然快速有效，但这种混合模型无法进行端到端地学习。

用卷积网络进行密集预测 最近的一些研究已经将卷积网络应用于密集预测问题，包括Ning等[27]，Farabet等[7]，Pinheiro和Collobert 等[28] ；Ciresan等人的电子显微镜边界预测[2]，Ganin和Lempitsky的混合卷积网络/最近邻模型的自然图像边界预测[9]；Eigen等人的图像恢复和深度估计 [4,5]。这些方法的共同要素包括：
• 限制容量和感受野的小模型;
• 逐块训练[27, 2, 7, 28, 9];
• 有超像素投影，随机场正则化，滤波或局部分类[7,2,9]的后处理过程;
• 密集输出的输入移位和输出交错[29,28,9];
• 多尺度金字塔处理[7,28,9];
• 饱和tanh非线性[7,4,28];
• 集成[2,9]
而我们的方法没有这种机制。然而，我们从FCN的角度研究了逐块训练3.4节和“移位 - 连接”密集输出3.2节。我们还讨论了网络内上采样3.3节，其中Eigen等人的全连接预测 [6]是一个特例。

与这些现有方法不同，我们采用并扩展了深度分类架构，使用图像分类作为有监督的预训练，并通过全卷积微调，以简单有效的从整个图像输入和整个图像的Ground Truths中学习。

在机器学习中ground truth表示有监督学习的训练集的分类准确性，用于证明或者推翻某个假设。有监督的机器学习会对训练数据打标记，试想一下如果训练标记错误，那么将会对测试数据的预测产生影响，因此这里将那些正确打标记的数据成为ground truth。
Hariharan等人 [15]和Gupta等人 [13]同样使深度分类网适应语义分割，但只在混合建议 - 分类器模型中这样做。这些方法通过对边界框和/或候选域采样来微调R-CNN系统[10]，以进行检测，语义分割和实例分割。这两种方法都不是端到端学习的。他们分别在PASCAL VOC和NYUDv2实现了最先进的分割成果，因此我们直接在第5节中将我们的独立端到端FCN与他们的语义分割结果进行比较。

我们融合各层的特征去定义一个我们端到端调整的非线性局部到全局的表示。在当代工作中，Hariharan等人[16]也在其混合模型中使用了多层来进行语义分割。

介绍了FCN中关键部分的全卷积网络和密集预测的研究现状

3.Fully convolutional networks

Each layer of data in a convnet is a three-dimensional array of size h × w × d, where h and w are spatial dimensions, and d is the feature or channel dimension. The first layer is the image, with pixel size h × w, and d color channels. Locations in higher layers correspond to the locations in the image they are path-connected to, which are called their receptive fields.

Convnets are built on translation invariance. Their basic components (convolution, pooling, and acti-vation functions) operate on local input regions, and depend only on relative spatial coordinates. Writ-ing $X_{ij}$ for the data vector at location (i, j) in a particular layer, and $y_{ij}$ for the following layer, these functions compute outputs $y_{ij}$ by
$y_{ij} = f_{ks}(\{x_{si+\delta i,sj+\delta j}\}_{0\le \delta i,\delta j \le k})$
where k is called the kernel size, s is the stride or subsampling factor, and $f_{ks}$ determines the layer type: a matrix multiplication for convolution or average pooling, a spatial max for max pooling, or an elementwise nonlinearity for an activation function, and so on for other types of layers.

This functional form is maintained under composition,with kernel size and stride obeying the trans-formation rule
$f_{ks}\circ g_{k's'} = (f\circ g)_{k'+(k-1)s',ss'}.$
While a general deep net computes a general nonlinear function, a net with only layers of this form computes a nonlinear filter, which we call a deep filter or fully convolutional network. An FCN naturally operates on an input of any size, and produces an output of corresponding (possibly resampled) spatial dimensions.

A real-valued loss function composed with an FCN defines a task. If the loss function is a sum over the spatial dimensions of the final layer, $l(x;θ) =\sum_{ ij}l'(x_{ij};θ)$ , its gradient will be a sum over the gra-dients of each of its spatial components. Thus stochastic gradient descent on $l$ computed on whole images will be the same as stochastic gradient descent on $l'$ , taking all of the final layer receptive fields as a minibatch.

When these receptive fields overlap significantly, both feedforward computation and backpropagation are much more efficient when computed layer-by-layer over an entire image instead of independently patch-by-patch.

We next explain how to convert classification nets into fully convolutional nets that produce coarse output maps. For pixelwise prediction, we need to connect these coarse outputs back to the pixels. Section 3.2 describes a trick, fast scanning [11], introduced for this purpose. We gain insight into this trick by reinterpreting it as an equivalent network modification. As an efficient, effective alternative, we introduce deconvolution layers for upsampling in Section 3.3. In Section 3.4 we consider training by patchwise sampling, and give evidence in Section 4.3 that our whole image training is faster and equally effective.

3.全卷积网络

卷积网络中的每一层数据都是大小为h×w×d的三维数组，其中h和w是空间维度，d是特征或通道维度。第一层是有着像素大小为h×w，以及d个颜色通道的图像，较高层中的位置对应于它们路径连接的图像中的位置，这些位置称为其感受野。

卷及网络建立在平移不变性的基础之上。它们的基本组成部分（卷积，池化和**函数）作用于局部输入区域，并且仅依赖于相对空间坐标。用 $X_{ij}$ 表示特定层位置（x,j）处的数据向量， $y_{ij}$ 表示下一层的数据向量，可以通过下式来计算 $y_{ij}$ :
$y_{ij} = f_{ks}(\{x_{si+\delta i,sj+\delta j}\}_{0\le \delta i,\delta j \le k})$
其中k称为内核大小，s为步长或者子采样因子， $f_{ks}$ 决定层的类型：用于卷积或平均池化的矩阵乘法，用于最大池化的空间最大值，或用于**函数的非线性元素，用于其他类型的层等等。

这种函数形式在组合下维护，内核大小和步长遵守转换规则：
$f_{ks}\circ g_{k's'} = (f\circ g)_{k'+(k-1)s',ss'}.$
当一般的深度网络计算一般的非线性函数，只有这种形式的层的网络计算非线性滤波器，我们称之为深度滤波器或完全卷积网络。FCN自然地对任何大小的输入进行操作，并产生相应（可能重采样）空间维度的输出。

由FCN组成的实值损失函数定义了任务。如果损失函数是最终层的空间维度的总和， $l(x;θ) =\sum_{ ij}l'(x_{ij};θ)$ ,它的梯度将是每个空间分量的梯度之和。因此，对整个图像计算的 $l$ 的随机梯度下降将与将所有最终层感受野视为小批量的 $l'$ 上的随机梯度下降相同。

当这些感受野显着重叠时，前馈计算和反向传播在整个图像上逐层计算而不是单独逐块计算时效率更高。

接下来，我们将解释如何将分类网络转换为产生粗输出图的完全卷积网络。对于逐像素预测，我们需要将这些粗略输出连接回像素。第3.2节为此目的引入了一个技巧，即快速扫描[11]。我们通过将其重新解释为等效的网络修改来深入了解这一技巧。作为一种有效的替代方案，我们在第3.3节中介绍了用于上采样的反卷积层。在第3.4节中，我们考虑了通过逐点抽样进行的训练，并在第4.3节中给出了证据，证明我们的整体图像训练更快且同样有效。

3.1. Adapting classifiers for dense prediction

Typical recognition nets, including LeNet [21], AlexNet [20], and its deeper successors [31, 32], ostensibly take fixed-sized inputs and produce non-spatial outputs. The fully connected layers of these nets have fixed dimensions and throw away spatial coordinates. However, these fully connected layers can also be viewed as convolutions with kernels that cover their entire input regions. Doing so casts them into fully convolutional networks that take input of any size and output classification maps. This transformation is illustrated in Figure 2.
【论文翻译】Fully Convolutional Networks for Semantic Segmentation
Furthermore, while the resulting maps are equivalent to the evaluation of the original net on particular input patches, the computation is highly amortized over the overlapping regions of those patches. For example, while AlexNet takes 1.2 ms (on a typical GPU) to infer the classification scores of a 227×227 image, the fully convolutional net takes 22ms to produce a 10×10 grid of outputs from a 500×500 image, which is more than 5 times faster than the na¨ ıve approach1.

The spatial output maps of these convolutionalized models make them a natural choice for dense problems like semantic segmentation. With ground truth available at every output cell, both the forward and backward passes are straightforward, and both take advantage of the inherent computational efficiency (and aggressive optimization) of convolution. The corresponding backward times for the AlexNet example are 2.4 ms for a single image and 37 ms for a fully convolutional 10 × 10 output map, resulting in a speedup similar to that of the forward pass.

While our reinterpretation of classification nets as fully convolutional yields output maps for inputs of any size, the output dimensions are typically reduced by subsampling. The classification nets sub-sample to keep filters small and computational requirements reasonable. This coarsens the output of a fully convolutional version of these nets, reducing it from the size of the input by a factor equal to the pixel stride of the receptive fields of the output units.

3.1. 调整分类器以进行密集预测

典型的识别网络，包括LeNet [21]、AlexNet [20]及其更深层的后继网络[31、32]，表面上接受固定大小的输入并产生非空间输出。这些网络的全连接层具有固定的尺寸并且丢弃了空间坐标。然而，这些完全连接的层也可以被视为具有覆盖其整个输入区域的内核的卷积。这样做将它们转换成完全卷积的网络，该网络可以接受任何大小的输入并输出分类图。这个转换如图2所示。（相比之下，非卷积网，例如Le等人[20]的网络，缺乏这种能力。）
【论文翻译】Fully Convolutional Networks for Semantic Segmentation
此外，尽管生成的图等效于在特定输入块上对原始网络的评估，但在这些块的重叠区域上进行了高额摊销。例如，虽然AlexNet花费1.2毫秒（在典型的GPU上）来产生 227×227 图像的分类分数，但是全卷积网络需要22毫秒才能从500×500图像中生成10×10的输出网格，比现在的方法快5倍以上。

这些卷积模型的空间输出图使它们成为语义分割等密集问题的自然选择。由于每个输出单元都有可用的Ground Truth，前向和后向传递都很简单，并且都利用了卷积的固有计算效率（和主动优化）。AlexNet示例的相应后向时间对于单个图像是2.4ms，对于完全卷积10×10输出映射是37ms，导致类似于前向传递的加速。

尽管我们将分类网络重新解释为完全卷积，可以得到任何大小的输入的输出图，但输出维度通常通过二次取样来减少。分类网络子采样以保持过滤器较小并且计算要求合理。这使这些网络的完全卷积版本的输出变得粗糙，将其从输入的大小减少到等于输出单元的感受野的像素跨度的因子。

3.2. Shift-and-stitch is filter rarefaction

Dense predictions can be obtained from coarse outputs by stitching together output from shifted versions of the input. If the output is downsampled by a factor of $f$ , shift the input $x$ pixels to the right and $y$ pixels down, once for every ( $x$ , $y$ ) s.t. 0 ≤ $x$ , $y$ < $f$ . Process each of these $f^2$ inputs, and interlace the outputs so that the predictions correspond to the pixels at the centers of their receptive fields.

Although performing this transformation na¨ ıvely increases the cost by a factor of $f^2$ , there is a well-known trick for efficiently producing identical results [11, 29] known to the wavelet community as the à trous algorithm [25]. Consider a layer (convolution or pooling) with input stride s, and a subsequent convolution layer with filter weights $f_{ij}$ (eliding the irrelevant feature dimensions). Setting the lower layer’s input stride to 1 upsamples its output by a factor of $s$ . However, convolving the original filter with the upsampled output does not produce the same result as shift-and-stitch, because the original filter only sees a reduced portion of its (now upsampled) input. To reproduce the trick, rarefy the filter by enlarging it as
$f'_{ij} = \begin{cases} f_{i/s,j/s}, & if s\ divides\ both\ i\ and\ j; \\ 0, & otherwise, \end{cases}$
(with $i$ and $j$ zero-based). Reproducing the full net output of the trick involves repeating this filter enlargement layerby-layer until all subsampling is removed. (In practice, this can be done efficiently by processing subsampled versions of the upsampled input.)

Decreasingsubsamplingwithinanetisatradeoff: thefilters see finer information, but have smaller receptive fields and take longer to compute. The shift-and-stitch trick is another kind of tradeoff: the output is denser without decreasing the receptive field sizes of the filters, but the filters are prohibited from accessing information at a finer scale than their original design.

Although we have done preliminary experiments with this trick, we do not use it in our model. We find learning through upsampling, as described in the next section, to be more effective and efficient, especially when combined with the skip layer fusion described later on.

3.2 移位和拼接是过滤器稀疏

通过输入的不同版本的输出拼接在一起，可以从粗糙的输出中获得密集预测。如果输出被 $f$ 因子下采样，对于每个( $x$ , $y$ ) ，输入向右移 $x$ 个像素，向下移 $y$ 个像素（左上填充），s.t. 0 ≤ $x$ , $y$ < $f$ 。

尽管执行这种变换会很自然地使成本增加 $f^2$ 倍，但有一个众所周知的技巧可以有效地产生相同的结果[11，29]，小波界称之为à trous算法[25]。考虑一个具有输入步幅 $s$ 的层(卷积或池化)，以及随后的具有滤波器权重 $f_{ij}$ 的卷积层(省略不相关的特征尺寸)。将较低层的输入步幅设置为1会将其输出向上采样一个系数 $s$ 。但是，将原始滤波器与向上采样的输出进行卷积不会产生与移位拼接相同的结果，因为原始滤波器只看到其(现在向上采样的)输入的减少部分。要重现该技巧的话，请将过滤器放大为
$f'_{ij} = \begin{cases} f_{i/s,j/s}, & if s\ divides\ both\ i\ and\ j; \\ 0, & otherwise, \end{cases}$
（其中 $i$ 和 $j$ 从零开始）。再现技巧的完全网络输出涉及逐层重复放大此滤波器，直到删除所有子采样为止。（实际上，可以通过处理上采样输入的子采样版本来有效地完成此操作。）

减少网络内的二次采样是一个权衡：过滤器看到更精细的信息，但是感受野更小，计算时间更长。移位和拼接技巧是另一种权衡:在不减小滤波器感受野大小的情况下，输出更密集，但是滤波器被禁止以比其原始设计更精细的尺度访问信息。

虽然我们已经用这个技巧做了初步的实验，但是我们没有在我们的模型中使用它。我们发现通过上采样进行学习（如下一节所述）更加有效和高效，尤其是与后面描述的跳跃层融合相结合时。

3.3.Upsampling is backwards strided convolution

Another way to connect coarse outputs to dense pixels is interpolation. For instance, simple bilinear interpolation computes each output $y_{ij}$ from the nearest four inputs by a linear map that depends only on the relative positions of the input and output cells.

In a sense, upsampling with factor $f$ is convolution with a fractional input stride of $1/f$ . So long as $f$ is integral, a natural way to upsample is therefore backwards convolution (sometimes called deconvolution) with an output stride of $f$ . Such an operation is trivial to implement, since it simply reverses the forward and backward passes of convolution. Thus upsampling is performed in-network for end-to-end learning by backpropagation from the pixelwise loss.

Note that the deconvolution filter in such a layer need not be fixed (e.g., to bilinear upsampling), but can be learned. A stack of deconvolution layers and activation functions can even learn a nonlinear upsampling.

In our experiments, we find that in-network upsampling is fast and effective for learning dense prediction. Our best segmentation architecture uses these layers to learn to upsample for refined prediction in Section 4.2.

3.3. 上采样是反向跨步的卷积

将粗糙输出连接到密集像素的另一种方法是插值。例如，简单的双线性插值通过一个只依赖于输入和输出单元的相对位置的线性映射，从最近的四个输入计算每个输出 $y_{ij}$ 。

从某种意义上讲，使用因子 $f$ 进行上采样是对输入步长为 $1/f$ 的卷积。只要 $f$ 是整数的，那么向上采样的自然方法就是输出步长为 $f$ 的反向卷积（有时称为反褶积）。这种操作实现起来很简单，因为它只是反转卷积的前进和后退。因此，在网络中进行上采样可以通过像素损失的反向传播进行端到端学习。

注意，这种层中的反卷积滤波器不需要固定(例如，对于双线性上采样)，而是可以学习的。反卷积层和**函数的叠加甚至可以学习非线性上采样。

在我们的实验中，我们发现网络内上采样对于学习密集预测是快速有效的。在第4.2节中，我们的最佳分割架构使用这些层来学习如何对精确预测进行上采样。

3.4.Patchwise training is loss sampling

In stochastic optimization, gradient computation is driven by the training distribution. Both patchwise training and fully convolutional training can be made to produce any distribution, although their relative computational efficiency depends on overlap and minibatch size. Whole image fully convolutional training is identical to patchwise training where each batch consists of all the receptive fields of the units below the loss for an image (or collection of images). While this is more efficient than uniform sampling of patches, it reduces the number of possible batches. However, random selection of patches within an image may be recovered simply. Restricting the loss to a randomly sampled subset of its spatial terms (or, equivalently applying a DropConnect mask [36] between the output and the loss) excludes patches from the gradient computation.

If the kept patches still have significant overlap, fully convolutional computation will still speed up training. If gradients are accumulated over multiple backward passes, batches can include patches from several images.2

Sampling in patchwise training can correct class imbalance [27, 7, 2] and mitigate the spatial correlation of dense patches [28, 15]. In fully convolutional training, class balance can also be achieved by weighting the loss, and loss sampling can be used to address spatial correlation.

We explore training with sampling in Section 4.3, and do not find that it yields faster or better convergence for dense prediction. Whole image training is effective and efficient.

3.4. 逐块训练是采样损失

在随机优化中，梯度计算由训练分布驱动。逐块训练和全卷积训练都可以产生任何分布，尽管它们的相对计算效率取决于重叠和小批量大小。整个图像完全卷积训练与逐块训练相同，其中每一批都包含低于图像(或图像集合)损失的单元的所有感受野。虽然这比批次的均匀采样更有效，但它减少了可能的批次数量。然而随机选择一幅图片中patches可以简单地复现。将损失限制为其空间项的随机采样子集(或者，等效地在输出和损失之间应用DropConnect掩码[36])，将patches排除在梯度计算之外。

patch”, 指一个二维图片中的其中一个小块, 即一张二维图像中有很多个patch. 正如在神经网络的卷积计算中, 图像并不是一整块图像直接同卷积核进行运算, 而是被分成了很多很多个patch分别同卷积核进行卷积运算, 这些patch的大小取决于卷积核的size. 卷积核每次只查看一个patch, 然后移动到另一个patch, 直到图像分成的所有patch都参与完运算. 因为找不到合适的术语，暂不翻译这个词语

如果保留的patches仍然有明显的重叠，全卷积计算仍将加速训练。如果梯度是通过多次向后传播累积的，批次可以包括来自多个图像的patches $^2$

在逐块训练中采样可以纠正类不平衡[27，7，2]并减轻密集patches的空间相关性[28，15]。在全卷积训练中，也可以通过加权损失来实现类平衡，并且可以使用损失采样来解决空间相关性。

我们在第4.3节中探讨了抽样训练，但没有发现它对密集预测产生更快或更好的收敛。整体形象训练是有效和高效的。

4.Segmentation Architecture

We cast ILSVRC classifiers into FCNs and augment them for dense prediction with in-network upsamp-ling and a pixelwise loss. We train for segmentation by fine-tuning. Next, we add skips between layers to fuse coarse, semantic and local, appearance information. This skip architecture is learned end-to-end to refine the semantics and spatial precision of the output.

For this investigation, we train and validate on the PASCAL VOC 2011 segmentation challenge [6]. We train with a per-pixel multinomial logistic loss and validate with the standard metric of mean pixel inter-section over union, with the mean taken over all classes, including background. The training ignores pixels that are masked out (as ambiguous or difficult) in the ground truth.
【论文翻译】Fully Convolutional Networks for Semantic Segmentation

4.分割架构

我们将ILSVRC分类器转换成FCN网络，并通过网络内上采样和像素级损失来增强它们以进行密集预测。我们通过微调进行分割训练。接下来，我们在层与层之间添加跳跃结构来融合粗糙的、语义的和局部的外观信息。这种跳跃结构是端到端学习的，以优化输出的语义和空间精度。

在本次调查中，我们在PASCAL VOC 2011分割挑战上进行了训练和验证[6]。我们使用每个像素的多项式逻辑损失进行训练，并通过联合上的平均像素交集的标准度量进行验证，其中包括所有类的平均值，包括背景。训练忽略了在ground truth中被掩盖(如模糊或困难)的像素。

4.1. From classifier to dense FCN

We begin by convolutionalizing proven classification architectures as in Section 3. We consider the AlexNet architecture that won ILSVRC12, as well as the VGG nets and the GoogLeNet which did exceptionally well in ILSVRC14. We pick the VGG 16-layer net, which we found to be equivalent to the 19-layer net on this task. For GoogLeNet, we use only the final loss layer, and improve performance by discarding the final average pooling layer. We decapitate each net by discarding the final classifier layer, and convert all fully connected layers to convolutions. We append a $1\times 1$ convolution with channel dimension 21 to predict scores for each of the PASCAL classes (including background) at each of the coarse output locations, followed by a (backward) convolution layer to bilinearly upsample the coarse outputs to pixelwise outputs as described in Section 3.3. Table 1 compares the preliminary validation results along with the basic characteristics of each net. We report the best results achieved after convergence at a fixed learning rate (at least 175 epochs).

Our training for this comparison follows the practices for classification networks. We train by SGD with momentum.Gradients are accumulated over 20 images. We set fixed learning rates of $10^{-3}$ , $10^{-4}$ , and $5^{-5}$ for FCN-AlexNet, FCN-VGG16, and FCN-GoogLeNet, respectively, chosen by line search. We use momentum 0:9, weight decay of $5^{-4}$ or $2^{-4}$ , and doubled learning rate for biases. We zero-initialize the class scoring layer, as random initialization yielded neither better performance nor faster convergence. Dropout is included where used in the original classifier nets (however, training without it made little to no difference).

Fine-tuning from classification to segmentation gives reasonable predictions from each net. Even the worst model achieved ~75 percent of the previous best performance.FCN-VGG16 already appears to be better than previous methods at 56.0 mean IU on val, compared to 52.6 on test. Although VGG and GoogLeNet are similarly accurate as classifiers, our FCN-GoogLeNet did not match FCN-VGG16. We select FCN-VGG16 as our base network.

4.1 从分类器到D-FCN

首先，如第3节所述，对经过验证的分类体系进行卷积。我们考虑赢得ILSVRC12的AlexNet体系结构以及在ILSVRC14中表现出色的VGG网络和GoogLeNet。我们选择了VGG 16层网络，我们发现它相当于此任务上的19层网络。对于GoogLeNet，我们仅使用最终的损失层，并通过舍弃最终的平均池化层来提高性能。我们通过舍弃最终的分类器层来使每个网络失效，并将所有全连接层转换为卷积。我们附加一个具有21通道的 $1\times 1$ 卷积，以预测每个粗略输出位置处每个PASCAL类（包括背景）的分数，然后是一个（向后）卷积层，将粗略输出双线性上采样为像素输出，如第3.3节所述。表1比较了初步验证结果以及每个网络的基本特征。我们报告了以固定的学习速度（至少175 epochs）收敛后获得的最佳结果。

我们进行此比较的训练遵循分类网络的实践。我们通过SGD进行训练。渐变累积超过20张图像。我们将FCN-AlexNet，FCNVGG16和FCN-GoogLeNet的固定学习率分别设置为 $10^{-3}$ , $10^{-4}$ , 和 $5^{-5}$ （通过行搜索选择）。我们使用动量0：9，权重减少 $5^{-4}$ 或 $2^{-4}$ ，并将学习率提高一倍。我们将类的计分层初始化为零，因为随机初始化既不会产生更好的性能，也不会带来更快的收敛。在原始分类器网络中的使用包括了Dropout（但是，没有分类的训练几乎没有区别）。

从分类到细分的微调可为每个网络提供合理的预测。甚至最差的模型~以前最佳性能的75％。 FCN-VGG16在val的平均IU为56.0时已经比以前的方法好，而在测试时为52.6 [14]。尽管VGG和GoogLeNet作为分类器准确无误，但我们的FCN-GoogLeNet与FCNVGG16不匹配。我们选择FCN-VGG16作为我们的基础网络。

4.2 Image-to-Image Learning

The image-to-image learning setting includes high effective batch size and correlated inputs. This optimization requires some attention to properly tune FCNs.

We begin with the loss. We do not normalize the loss, so that every pixel has the same weight regardless of the batch and image dimensions. Thus we use a small learning rate since the loss is summed spatially over all pixels.

We consider two regimes for batch size. In the first, gradients are accumulated over 20 images. Accumulation reduces the memory required and respects the different dimensions of each input by reshaping the network. We picked this batch size empirically to result in reasonable convergence. Learning in this way is similar to standard classification training: each minibatch contains several images and has a varied distribution of class labels. The nets compared in Table 1 are optimized in this fashion.
【论文翻译】Fully Convolutional Networks for Semantic Segmentation

However, batching is not the only way to do imagewise learning. In the second regime, batch size one is used for online learning. Properly tuned, online learning achieves higher accuracy and faster convergence in both number of iterations and wall clock time. Additionally, we try a higher momentum of 0:99, which increases the weight on recent gradients in a similar way to batching. See Table 2 for the comparison of accumulation, online, and high momentum or “heavy” learning (discussed further in Section 6.2).

4.2图像到图像学习

图像到图像学习设置包括有效的批量大小和相关的输入。此优化需要一些注意才能正确调整FCN。

我们从损失开始。我们不对损失进行归一化，因此无论批次和图像尺寸如何，每个像素都具有相同的权重。因此，由于损耗是在所有像素上进行空间求和的，因此我们使用的学习率较低。

对于批处理大小，我们考虑两种方案。首先，梯度会累积在20张图像上。积累减少了所需的内存，并通过重塑网络来尊重每个输入的不同尺寸。我们凭经验选择了此批处理大小，以实现合理的收敛。以这种方式进行的学习类似于标准分类训练：每个小批量包含几个图像，并且具有不同的类标签分布。表1中比较的网络以这种方式进行了优化。

但是，批处理不是进行图像学习的唯一方法。在第二种方法中，批量大小为1的用于在线学习。经过适当调整，在线学习可以在迭代次数和挂钟时间方面实现更高的准确性和更快的收敛速度。此外，我们尝试使用0:99的较高动量，以类似于批处理的方式增加最近渐变的权重。有关积累，在线学习和高动量或“沉重”学习的比较，请参见表2（在6.2节中进一步讨论）。

4.3 Combining What and Where

We define a new fully convolutional net for segmentation that combines layers of the feature hierarchy and refines the spatial precision of the output. See Fig. 3.

While fully convolutionalized classifiers fine-tuned to semantic segmentation both recognize and localize, as shown in Section 4.1, these networks can be improved to make direct use of shallower, more local features. Even though these base networks score highly on the standard metrics, their output is dissatisfyingly coarse (see Fig. 4). The stride of the network prediction limits the scale of detail in the upsampled output.

We address this by adding skips [51] that fuse layer outputs, in particular to include shallower layers with finer strides in prediction. This turns a line topology into a DAG: edges skip ahead from shallower to deeper layers. It is natural to make more local predictions from shallower layers since their receptive fields are smaller and see fewer pixels. Once augmented with skips, the network makes and fuses predictions from several streams that are learned jointly and end-to-end.

Combining fine layers and coarse layers lets the model make local predictions that respect global structure. This crossing of layers and resolutions is a learned, nonlinear counterpart to the multi-scale representation of the Laplacian pyramid [26]. By analogy to the jet of Koenderick and van Doorn [27], we call our feature hierarchy the deep jet.

Layer fusion is essentially an elementwise operation. However, the correspondence of elements across layers is complicated by resampling and padding. Thus, in general, layers to be fused must be aligned by scaling and cropping. We bring two layers into scale agreement by upsampling the lower-resolution layer, doing so in-network as explained in Section 3.3. Cropping removes any portion of the upsampled layer which extends beyond the other layer due to padding. This results in layers of equal dimensions in exact alignment. The offset of the cropped region depends on the resampling and padding parameters of all intermediate layers. Determining the crop that results in exact correspondence can be intricate, but it follows automatically from the network definition (and we include code for it in Caffe).

Having spatially aligned the layers, we next pick a fusion operation. We fuse features by concatenation, and immediately follow with classification by a “score layer” consisting of a $1\times 1$ convolution. Rather than storing concatenated features in memory, we commute the concatenation and subsequent classification (as both are linear). Thus, our skips are implemented by first scoring each layer to be fused by $1\times 1$ convolution, carrying out any necessary interpolation and alignment, and then summing the scores. We also considered max fusion, but found learning to be difficult due to gradient switching. The score layer parameters are zero-initialized when a skip is added, so that they do not interfere with existing predictions of other streams. Once all layers have been fused, the final prediction is then upsampled back to image resolution.

Skip architectures for segmentation. We define a skip architecture to extend FCN-VGG16 to a three-stream net with eight pixel stride shown in Fig. 3. Adding a skip from pool4 halves the stride by scoring from this stride sixteen layer. The $2\times$ interpolation layer of the skip is initialized to bilinear interpolation, but is not fixed so that it can be learned as described in Section 3.3. We call this two-stream net FCN-16s, and likewise define FCN-8s by adding a further skip from pool3 to make stride eight predictions. (Note that predicting at stride eight does not significantly limit the maximum achievable mean IU; see Section 6.3.)

We experiment with both staged training and all-at-once training. In the staged version, we learn the single-stream FCN-32s, then upgrade to the two-stream FCN-16s and continue learning, and finally upgrade to the three-stream FCN-8s and finish learning. At each stage the net is learned end-to-end, initialized with the parameters of the earlier net. The learning rate is dropped $100\times$ from FCN-32s to FCN-16s and $100\times$ more from FCN-16s to FCN-8s, which we found to be necessary for continued improvements.

Learning all-at-once rather than in stages gives nearly equivalent results, while training is faster and less tedious. However, disparate feature scales make na€ ıve training prone to divergence. To remedy this we scale each stream by a fixed constant, for a similar in-network effect to the staged learning rate adjustments. These constants are picked to approximately equalize average feature norms across streams. (Other normalization schemes should have similar effect.)

With FCN-16s validation score improves to 65.0 mean IU,and FCN-8s brings a minor improvement to 65.5. At this point our fusion improvements have met diminishing returns, so we do not continue fusing even shallower layers.

【论文翻译】Fully Convolutional Networks for Semantic Segmentation

To identify the contribution of the skips we compare scoring from the intermediate layers in isolation, which results in poor performance, or dropping the learning rate without adding skips, which gives negligible improvement in score without refining the visual quality of output. All skip comparisons are reported in Table 3. Fig. 4 shows the progressively finer structure of the output.

4.3组合：what和where

我们定义了一个用于分割的新的全卷积网络，该网络结合了要素层次结构的各个层并改善了输出的空间精度。参见图3。

如第4.1节所示，虽然微调到语义分段的完全卷积分类器既可以识别也可以进行本地化，但可以改进这些网络以直接使用更浅，更本地化的特征。尽管这些基础网络在标准指标上得分很高，但它们的输出却令人毛骨悚然（见图4）。网络预测的步伐限制了上采样输出中细节的规模。

我们通过添加跳过层[51]来解决此问题，该跳过层融合了输出层，特别是在预测中包括了步幅较小的较浅层。这将线拓扑变成DAG：边沿从较浅的图层向较深的图层向前跳过。从较浅的层进行更多的局部预测是很自然的，因为它们的接收场较小并且看到的像素更少。一旦增加了跳过次数，网络就可以从端到端联合学习的多个流中进行预测并进行融合。

将精细层和粗糙层结合起来，可以使模型做出尊重整体结构的局部预测。层和分辨率的这种交叉是拉普拉斯金字塔[26]的多尺度表示的一种学习的非线性对应。类似于Koenderick和van Doorn [27]的喷射，我们将特征层次称为deep jet。

层融合本质上是元素操作。但是，跨层的元素对应关系由于重采样和填充而变得复杂。因此，通常，必须通过缩放和裁切来对齐要融合的图层。我们通过对较低分辨率的层进行上采样来使两层达到规模协议，如第3.3节中所述在网络内进行。裁剪会删除上采样层由于填充而延伸超出另一层的任何部分。这将导致尺寸相等的图层精确对齐。裁剪区域的偏移量取决于所有中间层的重采样和填充参数。确定产生精确对应关系的作物可能很复杂，但是它会自动根据网络定义进行（并且我们在Caffe中包括了代码）。

在空间上对齐图层之后，我们接下来选择融合操作。我们通过级联融合要素，然后立即通过由 $1\times 1$ 卷积组成的“分数层”进行分类。我们不是将串联的要素存储在内存中，而是对串联和随后的分类（因为它们都是线性的）进行通勤。因此，我们的跳过是通过首先将要融合的每一层评分来实现 $1\times 1$ 卷积，执行任何必要的内插和对齐，然后求和。我们还考虑了最大融合，但是发现由于梯度切换而使学习变得困难。当添加跳过时，记分层参数将被零初始化，以使它们不会干扰其他流的现有预测。一旦融合了所有图层，最终的预测就会被上采样回到图像分辨率。

跳跃分割架构。我们定义了一个跳跃体系结构，以将FCN-VGG16扩展到具有8个像素步幅的三流网络，如图3所示。从pool4添加一个跳跃通过从该步幅16层得分将步幅减半。 $2\times$ 跳跃的插值层被初始化为双线性插值，但不是固定的，因此可以按照第3.3节中的描述进行学习。我们将其称为两流净FCN-16，并通过从pool3添加另一个跳跃来进行大步八预测来定义FCN-8。（请注意，在第8步进行预测并不会明显限制可达到的最大平均IU；请参见第6.3节。）

我们尝试分阶段训练和一次训练。在分阶段版本中，我们学习单流FCN-32，然后升级到两流FCN-16，然后继续学习，最后升级到三流FCN-8，并完成学习。在每个阶段，都是通过端到端学习网络的，并使用早期网络的参数进行初始化。从FCN-32到FCN-16、FCN-16到FCN-8学习率下降100倍，我们发现这对于持续改进是必要的。

一次学习而不是分阶段学习可以得到几乎相等的结果，而训练则更快，更乏味。但是，不同的特征量表使幼稚的训练容易发散。为了解决这个问题，我们使用固定的常数缩放每个流，以实现与分阶段学习速率调整类似的网络内效果。选择这些常数以使流中的平均特征范数近似相等。（其他规范化方案应具有类似的效果）

使用FCN-16s时，验证分数平均IU提高到65.0，而FCN-8s则带来小幅提高到65.5。在这一点上，我们的融合改进遇到了收益递减的问题，因此我们不会继续融合更浅的层。

为了确定跳跃的贡献，我们单独比较了中间层的得分，这会导致性能较差，或者在不添加跳跃的情况下降低学习率，从而在不改善输出的视觉质量的情况下，分数的改善可忽略不计。表3中报告了所有跳过比较。图4显示了输出的逐渐精细的结构。

4.4 Experimental Framework

Fine-tuning. We fine-tune all layers by backpropagation through the whole net. Fine-tuning the output classifier alone yields only 73 percent of the full fine-tuning performance as compared in Table 3. Fine-tuning in stages takes 36 hours on a single GPU. Learning FCN-8s all-at-once takes half the time to reach comparable accuracy. Training from scratch gives substantially lower accuracy.

More training data. The PASCAL VOC 2011 segmentation training set labels 1,112 images. Hariharan et al. [52] collected labels for a larger set of 8,498 PASCAL training images, which was used to train the previous best system, SDS [14]. This training data improves the FCN-32s validation score5from 57.7 to 63.6 mean IU and improves the FCN-AlexNet score from 39.8 to 48.0 mean IU.

Loss. The per-pixel, unnormalized softmax loss is a natural choice for segmenting images of any size into disjoint classes, so we train our nets with it. The softmax operation induces competition between classes and promotes the most confident prediction, but it is not clear that this is necessary or helpful. For comparison, we train with the sigmoid cross-entropy loss and find that it gives similar results, even though it normalizes each class prediction independently.

【论文翻译】Fully Convolutional Networks for Semantic Segmentation

Patch sampling. As explained in Section 3.4, our whole image training effectively batches each image into a regular grid of large, overlapping patches. By contrast, prior work randomly samples patches over a full dataset [10], [11], [12],[13], [16], potentially resulting in higher variance batches that may accelerate convergence [53]. We study this tradeoff by spatially sampling the loss in the manner described earlier, making an independent choice to ignore each final layer cell with some probability 1-p. To avoid changing the effective batch size, we simultaneously increase the number of images per batch by a factor 1=p. Note that due to the efficiency of convolution, this form of rejection sampling is still faster than patchwise training for large enough values of p (e.g., at least for p > 0.2 according to the numbers in Section 3.1). Fig. 5 shows the effect of this form of sampling on convergence. We find that sampling does not have a significant effect on convergence rate compared to whole image training, but takes significantly more time due to the larger number of images that need to be considered per batch. We therefore choose unsampled, whole image training in our other experiments.

Class balancing. Fully convolutional training can balance classes by weighting or sampling the loss. Although our labels are mildly unbalanced (about 3/4 are background), we find class balancing unnecessary.

Dense prediction. The scores are upsampled to the input dimensions by backward convolution layers within the net.Final layer backward convolution weights are fixed to bilinear interpolation, while intermediate upsampling layers are initialized to bilinear interpolation, and then learned. This simple, end-to-end method is accurate and fast.

Augmentation. We tried augmenting the training data by randomly mirroring and “jittering” the images by translating them up to 32 pixels (the coarsest scale of prediction) in each direction. This yielded no noticeable improvement.

Implementation. All models are trained and tested with Caffe [54] on a single NVIDIA Titan X. Our models and code are publicly available at http://fcn.berkeleyvision.org.

4.4实验框架

微调。我们通过整个网络的反向传播对所有层进行微调。与表3相比，仅对输出分类器进行微调仅能获得全部微调性能的73％。分阶段微调在单个GPU上花费36个小时。一次学习FCN-8只需一半的时间即可达到可比的精度。从头开始训练会大大降低准确性。

更多训练数据。 PASCAL VOC 2011分割训练集可标记1112张图像。 Hariharan等人为一组更大的8498张PASCAL训练图像收集了标签，这些图像用于训练以前的最佳系统SDS 。该训练数据将FCN-32s验证评分5从57.7均值IU提高到63.6，并将FCN-AlexNet评分从39.8均值IU提高到48.0。

损失。 对于将任何大小的图像分割为不相交的类别，按像素，未归一化的softmax损失是一种自然的选择，因此我们使用它来训练网络。 softmax运算会引起类别之间的竞争并促进最自信的预测，但是尚不清楚这是否必要或有帮助。为了进行比较，我们对S形交叉熵损失进行了训练，发现即使它独立地标准化了每个类别的预测，它也会产生相似的结果。

Patch采样。 如第3.4节所述，我们的整个图像训练有效地将每个图像分批成一个规则的大重叠网格块。相比之下，先前的工作在整个数据集上随机采样补丁，可能导致更高的方差批次，从而可能加速收敛。我们通过以前面描述的方式对损失进行空间采样来研究这种折衷，并做出独立选择，以1-p的概率忽略每个最终层像元。为了避免更改有效的批次大小，我们同时将每批次的图像数量增加了1=p。请注意，由于卷积效率高，对于足够大的p值（例如，至少根据第3.1节中的p> 0.2而言），这种形式的拒绝采样仍比分片式训练更快。图5显示了这种形式的抽样对收敛的影响。我们发现，与整个图像训练相比，采样对收敛速度没有显着影响，但是由于每批需要考虑的图像数量更多，因此采样花费的时间明显更多。因此，我们在其他实验中选择未采样的整体图像训练。

类别平衡。 完全卷积训练可以通过加权或采样损失来平衡类。尽管我们的标签略有不平衡（背景为大约为3 / 4），但我们发现类平衡是不必要的。

密集预测。 分数通过网络中的后向卷积层上采样到输入维度。最终层的后向卷积权重固定为双线性插值，而中间的上采样层则初始化为双线性插值，然后学习。这种简单的端到端方法准确而快速。

增强。 我们尝试通过随机镜像和“抖动”图像来增强训练数据，方法是将图像在每个方向上最多转换为32个像素（最粗的预测比例）。这没有产生明显的改善。

实施。 所有模型都在单个NVIDIA Titan X上使用Caffe [54]进行了培训和测试。我们的模型和代码可在http://fcn.berkeleyvision.org上公开获得。

5 RESULTS

We test our FCN on semantic segmentation and scene parsing, exploring PASCAL VOC, NYUDv2, SIFT Flow, and PASCAL-Context. Although these tasks have historically distinguished between objects and regions, we treat both uniformly as pixel prediction. We evaluate our FCN skip architecture on each of these datasets, and then extend it to multimodal input for NYUDv2 and multi-task prediction for the semantic and geometric labels of SIFT Flow. All experiments follow the same network architecture and optimization settings decided on in Section 4.

Metrics. We report metrics from common semantic segmentation and scene parsing evaluations that are variations on pixel accuracy and region intersection over union (IU):

pixel accuracy: $\sum _{i}n^{_{ii}}/\sum _{i}t_{i}$
mean accuraccy: $\left ( 1/n_{cl} \right )\sum_{i}n^{_{ii}}/t{_{i}}$
mean IU: $\left ( 1/n_{cl} \right )\sum_{i}n^{_{ii}}/\left ( t{_{i}}+\sum _{j}n_{ji}-n_{ii} \right )$
frequency weighted IU: $\left ( \sum _{k}t_{k} \right )^{-1}\sum _{i}t_{i}n_{ii}/\left ( t_{i}+\sum _{j}n_{ji}-n_{ii} \right )$

where $n_{{ij}}$ is the number of pixels of class $i$ predicted to belong to class $j$ , there are $n_{cl}$ different classes, and $t_{i}=\sum _{j}n_{ij}$ is the total number of pixels of class $i$ .

【论文翻译】Fully Convolutional Networks for Semantic Segmentation

PASCAL VOC. Table 4 gives the performance of our FCN-8s on the test sets of PASCAL VOC 2011 and 2012, and compares it to the previous best, SDS [14], and the well-known R-CNN [5]. We achieve the best results on mean IU by 30 percent relative. Inference time is reduced $114\times$ (convnet only, ignoring proposals and refinement) or $286\times$ (overall). Fig. 6 compares the outputs of FCN-8s and SDS.

【论文翻译】Fully Convolutional Networks for Semantic Segmentation

NYUDv2. is an RGB-D dataset collected using the Microsoft Kinect. It has 1,449 RGB-D images, with pixelwise labels that have been coalesced into a 40 class semantic segmentation task by Gupta et al. We report results on the standard split of 795 training images and 654 testing images.Table 5 gives the performance of several net variations. First we train our unmodified coarse model (FCN-32s) on RGB images. To add depth information, we train on a model upgraded to take four-channel RGB-D input (early fusion). This provides little benefit, perhaps due to similar number of parameters or the difficulty of propagating meaningful gradients all the way through the net. Following the success of Gupta et al, we try the three-dimensional HHA encoding of depth and train a net on just this information. To effectively combine color and depth, we define a “late fusion” of RGB and HHA that averages the final layer scores from both nets and learn the resulting two-stream net end-to-end. This late fusion RGB-HHA net is the most accurate.

SIFT Flow. is a dataset of 2,688 images with pixel labels for 33 semantic classes (“bridge”, “mountain”, “sun”), as well as three geometric classes (“horizontal”, “vertical”, and “sky”). An FCN can naturally learn a joint representation that simultaneously predicts both types of labels. We learn a two-headed version of FCN-32/16/8s with semantic and geometric prediction layers and losses. This net performs as well on both tasks as two independently trained nets, while learning and inference are essentially as fast as each independent net by itself. The results in Table 6, computed on the standard split into 2,488 training and 200 test images,show better performance on both tasks.

PASCAL-Context. provides whole scene annotations of PASCAL VOC 2010. While there are 400+ classes, we follow the 59 class task defined by that picks the most frequent classes. We train and evaluate on the training and val sets respectively. In Table 7 we compare to the previous best result on this task. FCN-8s scores 39.1 mean IU for a relative improvement of more than 10 percent.

5 结果

我们测试FCN的语义分割和场景解析，探索PASCAL VOC，NYUDv2，SIFT Flow和PASCAL-Context。尽管这些任务历来在对象和区域之间有所区别，但我们将两者均视为像素预测。我们在每个数据集上评估FCN跳过体系结构，然后将其扩展到NYUDv2的多模式输入，以及SIFT Flow的语义和几何标签的多任务预测。所有实验均遵循第4节中确定的相同网络架构和优化设置。

指标。 我们从常见的语义分割和场景解析评估报告了指标，这些指标是像素精度和联合区域交叉（IU）的变化：

像素精度： $\sum _{i}n^{_{ii}}/\sum _{i}t_{i}$
平均准确度： $\left ( 1/n_{cl} \right )\sum_{i}n^{_{ii}}/t{_{i}}$
平均IU： $\left ( 1/n_{cl} \right )\sum_{i}n^{_{ii}}/\left ( t{_{i}}+\sum _{j}n_{ji}-n_{ii} \right )$
频率加权IU： $\left ( \sum _{k}t_{k} \right )^{-1}\sum _{i}t_{i}n_{ii}/\left ( t_{i}+\sum _{j}n_{ji}-n_{ii} \right )$

其中 $n_{{ij}}$ 是 $i$ 类预测属于 $j$ 类的像素的数目，共有 $n_{cl}$ 个不同的类， $t_{i}=\sum _{j}n_{ij}$ 是 $i$ 类的像素总数。

PASCAL VOC. 表4给出了FCN-8在PASCAL VOC 2011和2012测试台上的性能，并将其与之前的最佳产品SDS 和著名的R-CNN 进行了比较。我们以30％的相对IU获得最佳结果。推理时间减少了114倍（仅限于convnet，忽略提案和完善内容）或286倍（总体）。图6比较了FCN-8和SDS的输出。

NYUDv2. 是使用Microsoft Kinect收集的RGB-D数据集。它具有1449个RGB-D图像，带有按像素划分的标签，由Gupta等人合并为40类语义分割任务。我们报告了795张训练图像和654张测试图像的标准分割结果。表5给出了几种净变化的性能。首先，我们在RGB图像上训练未修改的粗糙模型（FCN-32s）。为了增加深度信息，我们训练了一个升级后的模型以采用四通道RGB-D输入（早期融合）。这可能带来的好处很小，这可能是由于参数数量相似或难以通过网络传播有意义的梯度所致。继成功Gupta等。我们尝试对深度进行三维HHA编码，并仅基于此信息训练网络。为了有效地结合颜色和深度，我们定义了RGB和HHA的“后期融合”，将两个网的最终层得分取平均值，并学习由此产生的端到端两流网。后期融合的RGB-HHA网络是最准确的。

SIFT流. 是2688个图像的数据集，带有33个语义类别（“桥”，“山”，“太阳”）以及三个几何类别（“水平”，“垂直”和“天空”）的像素标签。 FCN可以自然地学习可以同时预测两种标签类型的联合表示。我们学习了带有语义和几何预测层以及损失的FCN-32/16/8s的两头版本。该网络在两个任务上的表现都好于两个独立训练的网络，而学习和推理基本上与每个独立的网络一样快。表6中的结果按标准划分为2488个训练图像和200张测试图像6，在两个任务上均显示出更好的性能。

PASCAL-Context. 提供了PASCAL VOC 2010的整个场景注释。虽然有400多个类，但我们遵循[62]定义的59类任务，该任务选择最频繁的类。我们分别训练和评估训练集和评估集。在表7中，我们将该任务与之前的最佳结果进行了比较。 FCN-8的平均IU得分为39.1，相对改善了10％以上。

6 ANALYSIS

We examine the learning and inference of fully convolutional networks. Masking experiments investigate the role of context and shape by reducing the input to only foreground, only background, or shape alone. Defining a “null” background model checks the necessity of learning a background classifier for semantic segmentation. We detail an approximation between momentum and batch size to further tune whole image learning. Finally, we measure bounds on task accuracy for given output resolutions to show there is still much to improve.

6 分析

我们研究了全卷积网络的学习和推论。遮罩实验通过将输入仅减少到仅前景，仅背景或形状来调查上下文和形状的作用。定义“空”背景模型检查了学习用于语义分割的背景分类器的必要性。我们详细介绍了动量和批量大小之间的近似值，以进一步调整整个图像学习。最后，我们在给定的输出分辨率下测量任务准确性的界限，以显示仍有很多地方需要改进。

6.1 Cues

Given the large receptive field size of an FCN, it is natural to wonder about the relative importance of foreground and background pixels in the prediction. Is foreground appearance sufficient for inference, or does the context influence the output? Conversely, can a network learn to recognize a class by its shape and context alone?

【论文翻译】Fully Convolutional Networks for Semantic Segmentation

Masking. To explore these issues we experiment with masked versions of the standard PASCAL VOC segmentation challenge. We both mask input to networks trained on normal PASCAL, and learn new networks on the masked PASCAL. See Table 8 for masked results.

Masking the foreground at inference time is catastrophic. However, masking the foreground during learning yields a network capable of recognizing object segments without observing a single pixel of the labeled class. Masking the background has little effect overall but does lead to class confusion in certain cases. When the background is masked during both learning and inference, the network unsurprisingly achieves nearly perfect background accuracy; however certain classes are more confused. All-in-all this suggests that FCNs do incorporate context even though decisions are driven by foreground pixels.

【论文翻译】Fully Convolutional Networks for Semantic Segmentation

To separate the contribution of shape, we learn a net restricted to the simple input of foreground/background masks. The accuracy in this shape-only condition is lower than when only the foreground is masked, suggesting that the net is capable of learning context to boost recognition. Nonetheless, it is surprisingly accurate. See Fig. 7.

Background modeling. It is standard in detection and semantic segmentation to have a background model. This model usually takes the same form as the models for the classes of interest, but is supervised by negative instances. In our experiments we have followed the same approach, learning parameters to score all classes including background. Is this actually necessary, or do class models suffice?

To investigate, we define a net with a “null” background model that gives a constant score of zero. Instead of training with the softmax loss, which induces competition by normalizing across classes, we train with the sigmoid cross-entropy loss, which independently normalizes each score. For inference each pixel is assigned the highest scoring class. In all other respects the experiment is identi- cal to our FCN-32s on PASCAL VOC. The null background net scores 1 point lower than the reference FCN-32s and a control FCN-32s trained on all classes including background with the sigmoid cross-entropy loss. To put this drop in perspective, note that discarding the background model in this way reduces the total number of parameters by less than 0.1 percent. Nonetheless, this result suggests that learning a dedicated background model for semantic segmentation is not vital.

6.1 Cues

给定FCN的大接收场大小，自然要怀疑前景像素和背景像素在预测中的相对重要性。前景外观足以进行推断吗，还是上下文会影响输出？相反，网络能否仅通过其形状和上下文来学习识别一个类？

掩膜。 为了探究这些问题，我们尝试使用标准PASCAL VOC分段挑战的隐藏版本。我们既屏蔽了在常规PASCAL上训练的网络的输入，又在屏蔽的PASCAL上学习了新的网络。屏蔽结果请参见表8。

在推理时屏蔽前景是灾难性的。但是，在学习过程中遮盖前景会产生一个网络，该网络能够识别对象段而无需观察标记类别的单个像素。掩盖背景总体上影响不大，但在某些情况下确实会导致班级混乱。当在学习和推理过程中都掩盖了背景时，网络毫不奇怪地实现了近乎完美的背景精度；但是某些类比较混乱。总而言之，这表明FCN确实包含了上下文，即使决策是由前景像素驱动的。

为了分离形状的影响，我们学习了仅限于前景/背景蒙版的简单输入的网络。与仅屏蔽前景时相比，此仅形状条件下的准确性较低，这表明网络能够学习上下文以增强识别能力。但是，它出奇的准确。参见图7。

背景建模。 具有背景模型是检测和语义分段的标准。该模型通常采用与感兴趣类的模型相同的形式，但是由否定实例进行监督。在我们的实验中，我们采用了相同的方法，即学习参数以评分包括背景在内的所有课程。这真的必要吗，或者类模型足够了吗？

为了进行调查，我们定义了一个具有“零”背景模型的网络，该模型给出的恒定分数为零。我们不是使用softmax损失进行训练，而是通过跨类进行归一化来诱导竞争，而是使用S形交叉熵损失进行训练，后者独立地对每个分数进行归一化。为了进行推断，为每个像素分配了最高评分等级。在所有其他方面，该实验与我们在PASCAL VOC上的FCN-32相同。零背景净得分比参考FCN-32和对照FCN-32低1分，该FCN-32在包括具有S型交叉熵损失的背景在内的所有类别上进行训练。为了更好地了解这一点，请注意，以这种方式丢弃背景模型会使参数总数减少不到0.1％。尽管如此，该结果表明，学习用于语义分割的专用背景模型并不是至关重要的。

6.2 Momentum and Batch Size

In comparing optimization schemes for FCNs, we find that “heavy” online learning with high momentum trains more accurate models in less wall clock time (see Section 4.2). Here we detail a relationship between momentum and batch size that motivates heavy learning.

By writing the updates computed by gradient accumulation as a non-recursive sum, we will see that momentum and batch size can be approximately traded off, which suggests alternative training parameters. Let $g_{t}$ be the step taken by minibatch SGD with momentum at time $t$ ,
$\begin{aligned} g_{t}=& -\eta \sum _{i=0}^{k-1}\bigtriangledown _{\theta}l\left ( x_{kt+i;\theta _{t-1}} \right ) +pg_{t-1} \end{aligned}$
where $l\left ( x;\theta \right )$ is the loss for example $x$ and parameters $\theta$ , $p < 1$ is the momentum, $k$ is the batch size, and $\eta$ is the learning rate. Expanding this recurrence as an infinite sum with geometric coefficients, we have
$\begin{aligned} g_{t}=& -\eta\sum_{s=0}^{\infty }\sum _{i=0}^{k-1}p^{S}\bigtriangledown _{\theta }l\left ( x_{k\left ( t-s \right )+i};\theta _{t-s} \right ) \end{aligned}$

In other words, each example is included in the sum with coefficient $p^{\left \lfloor j/k \right \rfloor}$ , where the index $j$ orders the examples from most recently considered to least recently considered. Approximating this expression by dropping the floor,we see that learning with momentum $p$ and batch size $k$ appears to be similar to learning with momentum $p^{'}$ and batch size $k^{'}$ if $p^{\left ( 1/k \right )}=p^{'\left ( 1/k \right )}$ . Note that this is not an exact equivalence: a smaller batch size results in more frequent weight updates, and may make more learning progress for the same number of gradient computations. For typical FCN values of momentum 0:9 and a batch size of 20 images, an approximately equivalent training regime uses momentum $0.9^{\left ( 1/20 \right )}\approx 0.99$ and a batch size of one, resulting in online learning. In practice, we find that online learning works well and yields better FCN models in less wall clock time.

6.2动量和批量

在比较FCN的优化方案时，我们发现具有高动量的“繁重”在线学习可以在更短的挂钟时间内训练出更准确的模型（请参见第4.2节）。在这里，我们详细介绍了动量与批量大小之间的关系，这种关系会激发大量学习。

通过将梯度累积计算的更新写为非递归和，我们将看到动量和批量大小可以近似权衡，这建议了其他训练参数。让 $g_{t}$ 在时间 $t$ 处以小动量SGD采取的步骤，
$\begin{aligned} g_{t}=& -\eta \sum _{i=0}^{k-1}\bigtriangledown _{\theta}l\left ( x_{kt+i;\theta _{t-1}} \right ) +pg_{t-1} \end{aligned}$
其中 $l\left ( x;\theta \right )$ 是损失，例如 $x$ 和参数 $\theta$ ， $p < 1$ 是动量， $k$ 是批次大小， $\eta$ 是学习率。将递归扩展为具有几何系数的无限和，我们有:
$\begin{aligned} g_{t}=& -\eta\sum_{s=0}^{\infty }\sum _{i=0}^{k-1}p^{S}\bigtriangledown _{\theta }l\left ( x_{k\left ( t-s \right )+i};\theta _{t-s} \right ) \end{aligned}$
换句话说，每个示例都包含在系数 $p^{\left \lfloor j/k \right \rfloor}$ 的总和中，其中索引j将示例从最近考虑到最近考虑到排序。通过下限来近似该表达式，我们看到，如果 $p^{\left ( 1/k \right )}=p^{'\left ( 1/k \right )}$ ，则用动量 $p$ 和批量大小 $k$ 进行学习与用动量 $p^{'}$ 和批量大小 $k^{'}$ 进行学习相似。请注意，这并非完全等效：较小的批次大小会导致更频繁的重量更新，并且对于相同数量的梯度计算，可能会获得更多的学习进展。对于动量0:9的典型FCN值和20张图像的批量大小，近似等效的训练方案使用动量 $0.9^{\left ( 1/20 \right )}\approx 0.99$ ，批处理大小为1，因此可以在线学习。在实践中，我们发现在线学习工作良好，并且在更少的挂钟时间内可以产生更好的FCN模型。

6.3 Upper Bounds on IU

FCNs achieve good performance on the mean IU segmentation metric even with spatially coarse semantic prediction. To better understand this metric and the limits of this approach with respect to it, we compute approximate upper bounds on performance with prediction at various resolutions. We do this by downsampling ground truth images and then upsampling back to simulate the best results obtainable with a particular downsampling factor. The following table gives the mean IU on a subset5of PASCAL 2011 val for various downsampling factors.

factor	mean IU
128	50.9
64	73.3
32	86.1
16	92.8
8	96.4
4	98.5

Pixel-perfect prediction is clearly not necessary to achieve mean IU well above state-of-the-art, and, conversely, mean IU is a not a good measure of fine-scale accuracy. The gaps between oracle and state-of-the-art accuracy at every stride suggest that recognition and not resolution is the bottleneck for this metric.

6.3 IU的上限

FCN即使在空间上粗略的语义预测下，也能在平均IU分割指标上取得良好的性能。为了更好地理解该指标以及该方法相对于其的局限性，我们使用在各种分辨率下的预测来计算性能的近似上限。我们将地面实况图像放大，然后再上采样以模拟使用特定下采样因子可获得的最佳结果。下表列出了各种下采样因子下PASCAL 2011 val子集的平均IU。

像素完美预测显然不需要达到远高于最新水平的平均IU，相反，平均IU并不是衡量小尺寸精度的好方法。甲骨文和每个步骤的最新准确性之间的差距表明，识别而不是解析是该指标的瓶颈。

7 CONCLUSION

Fully convolutional networks are a rich class of models that address many pixelwise tasks. FCNs for semantic segmentation dramatically improve accuracy by transferring pretrained classifier weights, fusing different layer representations, and learning end-to-end on whole images. End-to-end, pixel-to-pixel operation simultaneously simplifies and speeds up learning and inference. All code for this paper is open source in Caffe, and all models are freely available in the Caffe Model Zoo. Further works have demonstrated the generality of fully convolutional networks for a variety of image-to-image tasks.

7 结论

全卷积网络是处理许多像素任务的丰富模型。用于语义分割的FCN通过传递预训练的分类器权重，融合不同的图层表示以及在整个图像上进行端到端学习，极大地提高了准确性。端到端像素间操作同时简化并加快了学习和推理速度。本文的所有代码在Caffe中都是开源的，并且所有模型都可以在Caffe Model Zoo中免费获得。进一步的工作已经证明了全卷积网络在各种图像到图像任务中的普遍性。

【论文翻译】Fully Convolutional Networks for Semantic Segmentation

Fully Convolutional Networks for Semantic Segmentation

用于语义分割的全卷积网络

Abstract

摘要

1.Introduction

1.引言

2.Related work

2.相关工作

3.Fully convolutional networks

3.全卷积网络

3.1. Adapting classifiers for dense prediction

3.1. 调整分类器以进行密集预测

3.2. Shift-and-stitch is filter rarefaction

3.2 移位和拼接是过滤器稀疏

3.3.Upsampling is backwards strided convolution

3.3. 上采样是反向跨步的卷积

3.4.Patchwise training is loss sampling

3.4. 逐块训练是采样损失

4.Segmentation Architecture

4.分割架构

4.1. From classifier to dense FCN

4.1 从分类器到D-FCN

4.2 Image-to-Image Learning

4.2图像到图像学习

4.3 Combining What and Where

4.3组合：what和where

4.4 Experimental Framework

4.4实验框架

5 RESULTS

5 结果

6 ANALYSIS

6 分析

6.1 Cues

6.1 Cues

6.2 Momentum and Batch Size

6.2动量和批量

6.3 Upper Bounds on IU

6.3 IU的上限

7 CONCLUSION

7 结论

相关推荐