论文题目：Fully Convolutional Networks for Semantic Segmentation
论文来源:Fully Convolutional Networks for Semantic Segmentation_2015_CVPR
翻译人：[email protected]实验室

Fully Convolutional Networks for Semantic Segmentation

Jonathan Long∗ Evan Shelhamer∗ Trevor Darrell UC Berkeley {jonlong,shelhamer,trevor}@cs.berkeley.edu

用于语义分割的全卷积网络

Jonathan Long∗ Evan Shelhamer∗ Trevor Darrell UC Berkeley {jonlong,shelhamer,trevor}@cs.berkeley.edu

Abstract

Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixelsto-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet [20], the VGG net [31], and GoogLeNet [32]) into fully convolutional networks and transfer their learned representations by fine-tuning [3] to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional net-work achieves stateof-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes less than one fifth of a second for a typical image.

摘要

卷积网络是可以产生特征层次的强大可视化模型。我们表明，卷积网络本身，经过端到端，像素到像素的训练，在语义分割方面超过了最先进的水平。我们的主要见解是建立“ 全卷积”网络，该网络可接受任意大小的输入，并通过有效的推理和学习产生相应大小的输出。我们定义并详细说明了完全卷积网络的空间，解释了它们在空间密集预测任务中的应用，并阐述了与先前模型的联系。我们将当代分类网络(AlexNet [20]、VGG网络[31]和GoogLeNet [32])改造成完全卷积网络，并通过微调[3]将它们的学习表示转移到分割任务中。然后，我们定义了一个跳跃结构，它将来自深度粗糙层的语义信息与来自浅层精细层的外观信息相结合，以生成准确和详细的分割。我们的全卷积网络实现了对PASCAL VOC(相对于2012年62.2%的平均IU改进率为20%)、NYUDv2和SIFT Flow的最先进的分割，而对于典型的图像，推断时间不到五分之一秒。

1. 解决了输入大小尺寸限制问题
2. 将经典分类网络改造成全卷积网络，开创了语义分割的先河，实现了像素级别（end-to-end）的分类预测
3. 实现了对PASCAL VOC、NYUDv2和SIFT Flow的最先进的分割
4. 主要技术贡献（卷积化、跳跃连接、反卷积）

1.Introduction

Convolutional networks are driving advances in recognition. Convnets are not only improving for whole-image classification [20, 31, 32], but also making progress on local tasks with structured output. These include advances in bounding box object detection [29, 10, 17], part and keypoint prediction [39, 24], and local correspondence [24, 8].

The natural next step in the progression from coarse to fine inference is to make a prediction at every pixel. Prior approaches have used convnets for semantic segmentation [27, 2, 7, 28, 15, 13, 9], in which each pixel is labeled with the class of its enclosing object or region, but with shortcomings that this work addresses.
Fully Convolutional Networks for Semantic Segmentation 全文翻译
We show that a fully convolutional network (FCN) trained end-to-end, pixels-to-pixels on semantic segmentation exceeds the state-of-the-art without further machinery. To our knowledge, this is the first work to train FCNs end-to-end (1) for pixelwise prediction and (2) from supervised pre-training. Fully convolutional versions of existing networks predict dense outputs from arbitrary-sized inputs. Both learning and inference are performed whole-image-ata-time by dense feedforward computation and backpropagation. In-network upsampling layers enable pixelwise prediction and learning in nets with subsampled pooling.

This method is efficient, both asymptotically and absolutely, and precludes the need for the compli-cations in other works. Patchwise training is common [27, 2, 7, 28, 9], but lacks the efficiency of fully convolutional training. Our approach does not make use of pre- and post-processing complications, including superpixels [7, 15], proposals [15, 13], or post-hoc refinement by random fields or local class-ifiers [7, 15]. Our model transfers recent success in classification [20, 31, 32] to dense prediction by reinterpreting classification nets as fully convolutional and fine-tuning from their learned representa-tions. In contrast, previous works haveappliedsmallconvnetswithoutsupervisedpre-training [7, 28, 27].

Semantic segmentation faces an inherent tension between semantics and location: global information resolves what while local information resolves where. Deep feature hierarchies encode location and semantics in a nonlinear local-to-global pyramid. We define a skip architecture to take advantage of this feature spectrum that combines deep, coarse, semantic information and shallow, fine, appearance information in Section 4.2 (see Figure 3).

In the next section, we review related work on deep classification nets, FCNs, and recent approaches to semantic segmentation using convnets. The following sections explain FCN design and dense prediction tradeoffs, introduce our architecture with in-network upsampling and multilayer combinations, and describe our experimental framework. Finally, we demonstrate state-of-the-art results on PASCAL VOC 2011-2, NYUDv2, and SIFT Flow.

1.引言

卷积网络正在推动识别技术的进步。卷积网络不仅在改善了整体图像分类[20，31，32]，而且还在具有结构化输出的局部任务上也取得了进展。这些进展包括边界框目标检测[29，10，17]，部分和关键点预测[39，24]，以及局部对应[24，8]。

从粗略推断到精细推理，很自然下一步是对每个像素进行预测。先前的方法已经使用了卷积网络用于语义分割[27，2，7，28，15，13，9]，其中每个像素都用其包围的对象或区域的类别来标记，但是存在该工作要解决的缺点。
Fully Convolutional Networks for Semantic Segmentation 全文翻译
图中21代表PASCAL VOC数据集P，总共 20 个小类（加背景 21 类）

我们表明了在语义分割上，经过端到端，像素到像素训练的完全卷积网络（FCN）超过了最新技术，而无需其他机制。据我们所知，这是第一个从（2）有监督的预训练，端到端训练地FCN（1）用于像素预测。现有网络的完全卷积版本可以预测来自任意大小输入的密集输出。学习和推理都是通过密集的前馈计算和反向传播在整个图像上进行的。网络内上采样层通过子采样池化来实现网络中的像素预测和学习。

这种方法在渐近性和绝对性两方面都是有效的，并且排除了对其他工作中的复杂性的需要。逐块训练是很常见的[27,2,8,28,11]，但缺乏完全卷积训练的效率。我们的方法没有利用前后处理的复杂性，包括超像素[8,16]，建议[16,14]，或通过随机字段或局部分类器进行事后细化[8,16]。我们的模型通过将分类网络重新解释为完全卷积并从其学习的表示中进行微调，将最近在分类任务[19,31,32]中取得的成功转移到密集预测任务。相比之下，以前的工作是在没有经过有监督的预训练的情况下应用了小型卷积网络[7，28，27]。

语义分割面临着语义和位置之间的固有矛盾：全局信息解决了什么，而局部信息解决了什么。深度特征层次结构以非线性的局部到全局金字塔形式对位置和语义进行编码。在4.2节中，我们定义了一个跳跃结构来充分利用这种结合了深层，粗略，语义信息和浅层，精细，外观信息的特征谱（请参见图3）。

在下一节中，我们将回顾有关深度分类网，FCN和使用卷积网络进行语义分割的最新方法的相关工作。以下各节介绍了FCN设计和密集的预测权衡，介绍了我们的具有网络内上采样和多层组合的架构，并描述了我们的实验框架。最后，我们演示了PASCAL VOC 2011-2，NYUDv2和SIFT Flow的最新结果。

介绍卷积的概念，以及如何过渡到要进行像素预测的一个想法，然后介绍了FCN主要的一些贡献，包括可以任意大小的输入，skip结构等，以及最后的实验结果。

2.Related work

Our approach draws on recent successes of deep nets for image classification [20, 31, 32] and transfer learning [3, 38]. Transfer was first demonstrated on various visual recognition tasks [3, 38], then on detection, and on both instance and semantic segmentation in hybrid proposalclassifier models [10, 15, 13]. We now re-architect and finetune classification nets to direct, dense prediction of semantic seg-mentation. We chart the space of FCNs and situate prior models, both historical and recent, in this framework.

Fully convolutional networks To our knowledge, the idea of extending a convnet to arbitrary-sized inputs first appeared in Matan et al. [26], which extended the classic LeNet [21] to recognize strings of digits. Because their net was limited to one-dimensional input strings, Matan et al. used Viterbi decoding to obtain their outputs. Wolf and Platt [37] expand convnet outputs to 2-dimensional maps of detection scores for the four corners of postal address blocks. Both of these historical works do inference and learning fully convolutionally for detection. Ning et al. [27] define a convnet for coarse multiclass segmentation of C. elegans tissues with fully convolutional inference.

Fully convolutional computation has also been exploited in the present era of many-layered nets. Sliding window detection by Sermanet et al. [29], semantic segmentation by Pinheiro and Collobert [28], and image restoration by Eigen et al. [4] do fully convolutional inference. Fully convolutional training is rare, but used effectively by Tompson et al. [35] to learn an end-to-end part detector and spatial model for pose estimation, although they do not exposit on or analyze this method.

Alternatively, He et al. [17] discard the nonconvolutional portion of classification nets to make a feature extractor. They combine proposals and spatial pyramid pooling to yield a localized, fixed-length feature for classification. While fast and effective, this hybrid model cannot be learned end-to-end.

Dense prediction with convnets Several recent works have applied convnets to dense prediction problems, including semantic segmentation by Ning et al. [27], Farabet et al.[7], and Pinheiro and Collobert [28]; boundary prediction for electron microscopy by Ciresan et al. [2] and for natural images by a hybrid convnet/nearest neighbor model by Ganin and Lempitsky [9]; and image restoration and depth estimation by Eigen et al. [4, 5]. Common elements of these approaches include
• small models restricting capacity and receptive fields;
• patchwise training [27, 2, 7, 28, 9];
• post-processing by superpixel projection, random field regularization, filtering, or local classification [7, 2, 9];
• input shifting and output interlacing for dense output [29, 28, 9];
• multi-scale pyramid processing [7, 28, 9];
• saturating tanh nonlinearities [7, 4, 28]; and
• ensembles [2, 9],
whereas our method does without this machinery. However, we do study patchwise training 3.4 and “shift-and-stitch” dense output 3.2 from the perspective of FCNs. We also discuss in-network upsamp-ling 3.3, of which the fully connected prediction by Eigen et al. [5] is a special case.

Unlike these existing methods, we adapt and extend deep classification architectures, using image classification as supervised pre-training, and fine-tune fully convolutionally to learn simply and eff-iciently from whole image inputs and whole image ground thruths.

Hariharan et al. [15] and Gupta et al. [13] likewise adapt deep classification nets to semantic segmen-tation, but do so in hybrid proposal-classifier models. These approaches fine-tune an R-CNN system [10] by sampling bounding boxes and/or region proposals for detection, semantic segmentation, and instance segmentation. Neither method is learned end-to-end. They achieve state-of-the-art segmen-tation results on PASCAL VOC and NYUDv2 respectively, so we directly compare our standalone, end-to-end FCN to their semantic segmentation results in Section 5.

We fuse features across layers to defineanonlinearlocalto-global representation that we tune end-to-end. In contemporary work Hariharan et al. [16] also use multiple layers in their hybrid model for se-mantic segmentation.

2.相关工作

我们的方法借鉴了最近成功的用于图像分类[20, 31, 32]和迁移学习[3,38]的深度网络。迁移首先在各种视觉识别任务[3,38]，然后是检测，以及混合提议分类器模型中的实例和语义分割任务[10,15,13]上进行了演示。我们现在重新构建和微调分类网络，来直接，密集地预测语义分割。我们绘制了FCN的空间，并在此框架中放置了历史和近期的先前模型。

全卷积网络 据我们所知，Matan等人首先提出了将一个卷积网络扩展到任意大小的输入的想法。 [26]，它扩展了classicLeNet [21]来识别数字串。因为他们的网络被限制为一维输入字符串，所以Matan等人使用Viterbi解码来获得它们的输出。Wolf和Platt [37]将卷积网络输出扩展为邮政地址块四个角的检测分数的二维图。这两个历史工作都是通过完全卷积进行推理和学习，以便进行检测。宁等人 [27]定义了一个卷积网络，通过完全卷积推理对秀丽隐杆线虫组织进行粗多类分割。

在当前的多层网络时代，全卷积计算也已经得到了利用。Sermanet等人的滑动窗口检测 [29]，Pinheiro和Collobert [28]的语义分割，以及Eigen等人的图像恢复 [4]都做了全卷积推理。全卷积训练很少见，但Tompson等人有效地使用了它 [35]来学习一个端到端的部分探测器和用于姿势估计的空间模型，尽管他们没有对这个方法进行解释或分析。

或者，He等人 [17]丢弃分类网络的非卷积部分来制作特征提取器。它们结合了建议和空间金字塔池，以产生用于一个局部化，固定长度特征的分类。虽然快速有效，但这种混合模型无法进行端到端地学习。

用卷积网络进行密集预测 最近的一些研究已经将卷积网络应用于密集预测问题，包括Ning等[27]，Farabet等[7]，Pinheiro和Collobert 等[28] ；Ciresan等人的电子显微镜边界预测[2]，Ganin和Lempitsky的混合卷积网络/最近邻模型的自然图像边界预测[9]；Eigen等人的图像恢复和深度估计 [4,5]。这些方法的共同要素包括：
• 限制容量和感受野的小模型;
• 逐块训练[27, 2, 7, 28, 9];
• 有超像素投影，随机场正则化，滤波或局部分类[7,2,9]的后处理过程;
• 密集输出的输入移位和输出交错[29,28,9];
• 多尺度金字塔处理[7,28,9];
• 饱和tanh非线性[7,4,28];
• 集成[2,9]
而我们的方法没有这种机制。然而，我们从FCN的角度研究了逐块训练3.4节和“移位 - 连接”密集输出3.2节。我们还讨论了网络内上采样3.3节，其中Eigen等人的全连接预测 [6]是一个特例。

与这些现有方法不同，我们采用病扩展了深度分类架构，使用图像分类作为有监督的预训练，并通过全卷积微调，以简单有效的从整个图像输入和整个图像的Ground Truths中学习。

Hariharan等人 [15]和Gupta等人 [13]同样使深度分类网适应语义分割，但只在混合建议 - 分类器模型中这样做。这些方法通过对边界框和/或候选域采样来微调R-CNN系统[10]，以进行检测，语义分割和实例分割。这两种方法都不是端到端学习的。他们分别在PASCAL VOC和NYUDv2实现了最先进的分割成果，因此我们直接在第5节中将我们的独立端到端FCN与他们的语义分割结果进行比较。

我们融合各层的特征去定义一个我们端到端调整的非线性局部到全局的表示。在当代工作中，Hariharan等人[16]也在其混合模型中使用了多层来进行语义分割。

Fully Convolutional Networks for Semantic Segmentation 全文翻译