【论文阅读笔记】——Batch Normalization:Accelerating Deep Network Traning by Reducing Internal Covariate Shift
Batch Normalization:Accelerating Deep Network Traning by Reducing Internal Covariate Shift
概括
BN减少了internal covariate shift,后者指代训练过程中数据经过每层后发生的分布变化,因为每层微小的参数更新即会影响其输出的数据的分布,而随着网络越深,这种影响会越来越大,使得学习率设置和网络初始化时我们要谨小慎微,对越深的网络尤是如此,而BN的引入减少了internal covariate shift,从而使得网络更容易初始化、可以使用更大的学习率、网络更快收敛、减少了饱和非线性单元造成的梯度弥散作用、作为正则化使用(可能不需要Dropout)
文章解决了什么问题
一些定义
covariate shift means the input distribution to a learning system changes
internal covariate shift means the distribution of each leayer’s inputs changes during training, which slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities.
问题
small changes to the network parameters amplify as the network becomes deeper
The change in the distributions of layers’ inputs presents a problem because the layers need to continuously adapt to the new distribution
用了什么方法
Batch Normalization: Take a step towrads reducing internal covariate shift to accelerate the trainning of deep neural nets normalizing layers inputs
Batch Normalizing Transform:
operate independently on each feature of
Any layer that previously received as input,now receives
At inference time:
效果如何
-
【Higher learning rate】allow us to user much higher learning rates and
-
【less bother when initializing】make us less careful about initialization
-
【call back saturating nonlinearities】makes it possible to user saturating nonlinearities by preventing the network from getting stuck in the saturated modes
-
【Fater training】applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin
-
【some invariance】has a beneficial effect on the gradient flow through the network by reducing the dependence of gradients on the scale of the parameters or of their initial values,allows us to use much higher learning rates without the risk of divergence. Back-propagation through a layer is unaffected by the scale of its parameters
and
The scale does not affect the layer Jacobian nor, consequently, the gradient propagation. Moreover, larger weights lead to smaller gradients, and Batch Normalization will stabilize the parameter growth. -
【Regularization】Act as a regularizer, in some cases eliminating the need(or strength) for Dropout
-
【In practice】using an ensemble of batch-normalized networks, improve upon the best published result on ImageNet classification:reaching 4.9% top-5 validation error(and 4.8% test error),exceeding the accuracy of human raters.
存在什么不足
文中似乎没有给出一些不足之处
其他
1.mini-batch的好处
- mini-batch相对单个样本来说,是对总体的损失值的更好的估计,batch size越大,估计越准确
- mini-batch相比单个样本训练更快,因为可以使用并行
2.BN理论上不会使得网络表现更差
当
时,BN是一个恒等变换,这使得网络(理论上)至少不会变差
3.卷积网络上的BN
卷积网络中为了与卷积的特性一致,对于每个**图分配一个和,而非之前的对每个**单元分配一个和,此时若**图是的,就有$m’=mpq $
4.BN层放在哪
文中辨析了给定一层
那么究竟是对进行BN还是对进行BN?
由于一般是其他非线性层的输出,它的分布形状在训练中在训练中可能会变化,且限制它的一阶和二阶矩不会消除covariate shift,而更可能有对称、非稀疏的分布,更加Gaussian(文中此处有引用),将其正规化更可能让**单元有一个稳定的分布
Future Work
- 在RNN上应用BN
- 检测BN是否能帮助domain adaptation——whether the normalization performed by the network would allow it to more easily generalize to new data distributions,perhaps with just a recomputation of the population means and variances
问题
1.为什么不是0
文章中
但因为
就有
那第二项就为0,进过实际计算,也是如此,那为何要将其写在公式中
2.一些单词
- inference,可能指网络测试时
- population,可能指训练集全集