Batch Normalization:Accelerating Deep Network Traning by Reducing Internal Covariate Shift

概括

BN减少了internal covariate shift,后者指代训练过程中数据经过每层后发生的分布变化，因为每层微小的参数更新即会影响其输出的数据的分布，而随着网络越深，这种影响会越来越大，使得学习率设置和网络初始化时我们要谨小慎微，对越深的网络尤是如此，而BN的引入减少了internal covariate shift，从而使得网络更容易初始化、可以使用更大的学习率、网络更快收敛、减少了饱和非线性单元造成的梯度弥散作用、作为正则化使用（可能不需要Dropout）

文章解决了什么问题

一些定义

covariate shift means the input distribution to a learning system changes

internal covariate shift means the distribution of each leayer’s inputs changes during training, which slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities.

问题

small changes to the network parameters amplify as the network becomes deeper

The change in the distributions of layers’ inputs presents a problem because the layers need to continuously adapt to the new distribution

用了什么方法

Batch Normalization: Take a step towrads reducing internal covariate shift to accelerate the trainning of deep neural nets normalizing layers inputs

Batch Normalizing Transform:
【论文阅读笔记】——Batch Normalization:Accelerating Deep Network Traning by Reducing Internal Covariate Shift
operate independently on each feature of $x$

Any layer that previously received $x$ as input,now receives $BN(x)$

At inference time:
【论文阅读笔记】——Batch Normalization:Accelerating Deep Network Traning by Reducing Internal Covariate Shift

效果如何

【Higher learning rate】allow us to user much higher learning rates and
【less bother when initializing】make us less careful about initialization
【call back saturating nonlinearities】makes it possible to user saturating nonlinearities by preventing the network from getting stuck in the saturated modes
【Fater training】applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin
【some invariance】has a beneficial effect on the gradient flow through the network by reducing the dependence of gradients on the scale of the parameters or of their initial values,allows us to use much higher learning rates without the risk of divergence. Back-propagation through a layer is unaffected by the scale of its parameters
$BN(Wu)=BN((aW)u)$
and
$\dfrac{\partial BN((aW)u)}{\partial u}=\dfrac{\partial BN(Wu)}{\partial u}$

$\dfrac{\partial BN((aW)u)}{\partial (aW)}=\dfrac{1}{a}\cdot\dfrac{\partial BN(Wu)}{\partial W}$
The scale does not affect the layer Jacobian nor, consequently, the gradient propagation. Moreover, larger weights lead to smaller gradients, and Batch Normalization will stabilize the parameter growth.
【Regularization】Act as a regularizer, in some cases eliminating the need(or strength) for Dropout
【In practice】using an ensemble of batch-normalized networks, improve upon the best published result on ImageNet classification:reaching 4.9% top-5 validation error(and 4.8% test error),exceeding the accuracy of human raters.

存在什么不足

文中似乎没有给出一些不足之处

其他

1.mini-batch的好处

mini-batch相对单个样本来说，是对总体的损失值的更好的估计，batch size越大，估计越准确
mini-batch相比单个样本训练更快，因为可以使用并行

2.BN理论上不会使得网络表现更差
当
$\gamma = var(x)\\ \beta = mean(x)$
时，BN是一个恒等变换，这使得网络（理论上）至少不会变差

3.卷积网络上的BN
卷积网络中为了与卷积的特性一致，对于每个**图分配一个 $\gamma$ 和 $\beta$ ，而非之前的对每个**单元分配一个 $\gamma$ 和 $\beta$ ，此时若**图是 $p\times q$ 的，就有$m’=mpq $

4.BN层放在哪
文中辨析了给定一层
$x=Wu+b\\ z=g(x)$
那么究竟是对 $u$ 进行BN还是对 $x$ 进行BN？
由于 $u$ 一般是其他非线性层的输出，它的分布形状在训练中在训练中可能会变化，且限制它的一阶和二阶矩不会消除covariate shift，而 $Wu+b$ 更可能有对称、非稀疏的分布，更加Gaussian(文中此处有引用)，将其正规化更可能让**单元有一个稳定的分布

Future Work

在RNN上应用BN
检测BN是否能帮助domain adaptation——whether the normalization performed by the network would allow it to more easily generalize to new data distributions,perhaps with just a recomputation of the population means and variances

问题

1.为什么不是0
文章中
$\dfrac{\partial l}{\partial \mu_B}=(\sum_{i=1}^{m}\dfrac{\partial l}{\partial \hat{x}_i}\cdot \dfrac{-1}{\sqrt{\sigma_B^2+\epsilon}})+\dfrac{\partial l}{\partial \sigma_B^2}\cdot\dfrac{\sum_{i=1}^m-2(x_i-\mu_B)}{m}$
但因为
$\mu_B = \dfrac{1}{m}\sum_{i=1}^mx_i$
就有
$\sum^m_{i=1}(x_i-\mu_B)=\sum_{i=1}^{m}x_i -m\mu_B=m\mu_B-m\mu_B=0$
那第二项就为0，进过实际计算，也是如此，那为何要将其写在公式中
2.一些单词

inference，可能指网络测试时
population，可能指训练集全集

【论文阅读笔记】——Batch Normalization:Accelerating Deep Network Traning by Reducing Internal Covariate Shift