原文地址：https://arxiv.org/pdf/1409.1556.pdf

1. Abstract

Our main contribution is a thorough evaluation of increasing depth using an architecture with very small (3×3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. (本文的主要贡献是对神经网络的深度进行了一个全面的评估，使用了非常小的(3×3)的卷积核。在神经网络深度推进16到19层时，3×3卷积核能对现有的最优模型进行重大改进。)
We also show that our representations generalize well to other datasets, where they achieve state-of-the-art results. (我们的神经网络对其他的数据库也有很好的泛化能力，也能达到最好的结果。)

2. Introduction

卷积神经网络拥有了大量的数据，以及GPUs计算能力的大量提升，ILSVRC比赛的贡献有了很大的发展。自2012年Krizhevsky的deep ConvNets之后有了很多进阶版，比如Zeiler & Fergus 在第一层卷积层使用更小的感知窗口和更小的步长，Sermanet等在训练和测试网络时频繁使用整幅图像和多个尺寸的图像。
In this paper, we address another important aspect of ConvNet architecture design – its depth. We fix other parameters of the architecture, and steadily increase the depth of the network by adding more convolutional layers, which is feasible due to the use of very small (3×3) convolution filters in all layers. (本文中，我们解决了卷积神经网络结构设计的卷积神经网络的深度问题，本文固定网络的其他参数，只增加网络的深度。由于所有的卷积层都是使用的3×3的卷积核，其他参数不变而只增加网络深度是可以实现的。)
本文提出的结构可以应用更多的数据库，包括之前的一些最优解法是传统机器学习方法（比如说SVM）的数据也可以使用本文提出的ConvNet 结构。
Sect.2: describe our ConvNet configurations.
Sect.3: the detail of the image classification training and evaluation.
Secr.4: the configurations are compared on ILSVRC classification task.

3. ConvNet Configurations

为了准确衡量深度对网络的提升，本文的卷积层采用相同的构造原则，下面分别介绍结构的通用布局和特殊构造细节，以及和别的卷积神经网络的比较。

3.1 Architecture（generic design）

1、使用的图像是224×224的RGB图像，唯一做预处理是将每个像素减去训练集计算的平均RGB值。
2、一些卷积层，使用3×3卷积核（可以捕捉左右上下中概念的最小尺寸）。其中有一种特别的配置1×1的卷积核，可以视作输入通道的线性变换。卷积步长为1像素，卷积层的空间填充，是指经过卷积后保留空间分辨率，即3个卷积层的填充为1像素。
3、一些卷积层之后跟着最大池化层（当然不是所有的层后面都要跟池化层），最大池化层窗口为2×2，步长为2。
4、 2和3 的组合（构成不同结构不同深度）之后接全连接层，前两个每个是4096 channels, 最后一个是1000个（ILSVRC分类结果是1000个）之后接soft-max layer。
5、阴藏层都使用ReLU。另外表示LRN没用（Krizhevsky et al., 2012）

3.2 configuration

本文的卷积神经网络方案都在下表：
读文献——《Very Deep Convolutional Networks for Large-scale Image Recognition》

All configurations follow the generic design presented in Sect.2.1, and differ only in the depth. The width of conv. Layers (the number of parameters) is rather small, starting from 64 in the first layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512.

3.3 Discussion

使用的卷积核很小3×3，步长1像素。可以使用两个3×3代替5×5和3个3×3替代7×7。优势：First, we incorporate three non-linear rectification layers instead of a single one, which makes the decision function more discriminative. Second, we decrease the number of parameters. (第一，相对于使用一个非线性整流层来说，使用三个ReLU让决策函数更加有辨识力。第二，减少了参数数量) 为什么说用三个3×3减少了参数数量呢？举例来说，假设输入输出都是三个有C个通道的3×3卷积，那么此网络参数有3（32C2）=27 C2, 而7×7参数量为72C2=49 C2。
The incorporation of 1×1 conv. layers (configuration C, table1) is a way to increase the non-linearity of the decision function without affecting the receptive fields of the conv. layers. (1×1卷积的使用（配置C，表1）是一种增加决策函数的非线性而不影响卷积层的接收域的方法。)
然后后面讲述了别人的研究有些什么不好的，比如说Ciresan et al. (2011)的小卷积没有这个深度，也没有的大规模数据的测试。GoogLeNet也是深层卷积神经网络也用小卷积核3×3，也用1×1和5×5，但是它比较复杂，the spatial resolution of the features map is reduced more aggressively in the first layers to decrease the amount of computation.

4. Classification Framework

4.1 Training

ConvNet 主要是根据Krizhevsky et al. (2012)。训练主要使用mini-batch gradient descent多项logistic 回归进行优化，前两层全连接层使用dropout(50%)，学习速率初始值为10-2，当验证集精度不在增加时降低10倍，虽然数据集大，但是深度增加后，收敛速度比Krizhevsky et al. (2012)快。
网络的权重初始化，层数比较少的时候直接随机初始化，增加深度之后我们初始化了前4个卷积层和最后3个与net A层完全连接的层(中间层是随机初始化的)。
训练图片大小S选取：
第一种，固定大小，选择S = 256和S = 384.
第二种，multi-scale [256-512]，为了快些，就对384的模型进行微调得到。

4.2 Testing

测试步骤：1、rescale 2、FC convert Conv 3、sum-pooled.
测试图片大小Q不一定等于S，可以选取多个Q结果还更好些。
Test时不使用multiple crop, 虽然在做研究时使用是有用的，但是现实情况下没用。
实现细节：使用多GPUs 并行计算，利用C++ Caffe toolbox，在4个Titan GPU上并行计算，比单独GPU快3.75倍，每个网络差不多2-3周。

5. Classification experiments

5.1 Single scale evaluation

使用固定尺寸大小的图片，S固定，Q = S；S∈[256-512]，Q = 0.5( Smin + Smax) = 384。首先，在没有任何正则化的情况下，使用本地正则化(A- LRN)并不能改善模型A。因此，我们没有在更深层次的架构中使用规范化。
读文献——《Very Deep Convolutional Networks for Large-scale Image Recognition》

使用微调尺寸大小的测试集增强对步骤多尺寸图像特征是有用的。

5.2 Multi-scale evaluation

固定S，Q = {S-32，S+32}；S∈[Smin , Smax]，Q = { Smin , 0.5(Smin + Smax), Smax }.
The deepest configurations (D and E) perform the best, and scale jittering is better than training with a fixed smallest side S. (最深的配置（D和E）执行效果最好，并且尺寸微调的效果比使用固定的最小边S进行训练要好。)

5.3 Multi-crop evaluation

Dense ConvNet evaluation vs. multiple crop
Multi-crop evaluation is complementary to dense evaluation due to different convolution boundary conditions: when applying ConvNet to a crop, the convolved feature maps are padded with zeros, while in the case of an image (due to both the convolutions and spatial pooling), which substantially increases the overall network receptive field, so more context is captured. (由于不同的卷积边界条件，多种模型融合评估与密集评估相辅相成：当ConvNet 应用于一个crop，卷积特征图用零填充，而在图像的情况下（由于卷积和空间合并），大大增加了整个网络的接收范围，因此可以捕获更多特征。)
使用多种模型融合的效果略好于深度验证，这两种方法确实是互补的，因为它们的组合效果优于各自。

5.4 ConvNet fusion

通过结果求平均，融合上面不同网络的结果。
在提交ILSVRC时，我们分别训练了单尺度网络和多尺度模型D(只微调了全连接层)。所得到的7个网络的总体ILSVRC测试误差为7.3%。
我们考虑两个性能最好的多尺度模型的集合(配置D和E)，使用dense evaluation将测试误差降低到7.0%，使用联合评价测试误差降低到6.8%，而单一模型的最佳性能达到7.1%的误差。

5.5 Comparison with the state of the art

我们的网络比以前所有的最优网络模型都好，同classification task 第一的GoogLeNet 也有一争之力，单一网络模型中我们也是最好的。

6. Conclusion

The representation depth is beneficial for the classification accuracy, and that state-of-the-art performance on the ImageNet challenge dataset can be achieved using a conventional ConvNet architecture (LeCun et al. 1989; Krizhevsky et al. 2012) with substantially increased depth. (网络的深度特征对分类任务的效果有很大的影响，ImageNet竞赛中，可以使用增加深度的传统卷积神经网络结构实现，得到最优的效果。)
本篇论文在附录也展开了对LOCALISATION EXPERIMENTS实验和对生成deep features的思考。

读文献——《Very Deep Convolutional Networks for Large-scale Image Recognition》