Lecture 6 Training Neural Network (2)

这节很重要！！！！！

我们先假设一个数据矩阵 [NxD]，N 是数据的数量，比如有多少张图片，D为维数。

Mean subtraction：最常用的数据预处理方法，就是用每一个数据都减去整个数据的均值，如果对于图像来说的话，是需要在rgb三个通道分别减去各自的均值。
Normalization：数据的归一化，使得整体数据在一定的规模范围内。有两种方法做归一化：一种是当已经zero-centered之后，每一维除以他们的标准差。另一种就是使得每一维归一化到-1至+1之间，做这个处理的前提是你有理由相信，这些特征是分属于不同的规模范围，但是对于学习算法又同等重要。对于图像处理来说，他们的数据都处于0-255之间，就是说有自己的规模范围，所以一般没有必要做归一化处理。

一个需要注意的地方：在计算data statistics的时候，比如mean, 只计算training data的均值，然后将其应用到 train,validation,test data set中。而不是计算整个数据集的mean!!

错误：将权重全部初始化为0.
Small random numbers：我们希望权重能够很接近于0。 W=0.01∗np.random.randn(D,H), 我们采取这个形式，randn指的就是 0均值，单位方差的高斯分布。
Calibrating the variances with 1/sqrt(n)：上述的方法都有一个问题就是，从随机初始化的神经元的输出的分布具有随输入数量增长的方差。事实证明，我们可以将每个神经元的输出的方差通过其扇入的平方根（即其输入的数量）缩放其权重向量而归一化为1。也就是说，推荐的启发式方法是将每个神经元的权向量初始化为：w = np.random.randn（n）/ sqrt（n），其中n是其输入的数量。这确保网络中的所有神经元最初具有大致相同的输出分布，并且经验地提高了收敛速度。
在一个paper 中有针对 ReLu neurons的。 derives an initialization specifically for ReLU neurons, reaching the conclusion that the variance of neurons in the network should be 2.0/n2.0/n. This gives the initialization w = np.random.randn(n) * sqrt(2.0/n), and is the current recommendation for use in practice in the specific case of neural networks with ReLU neurons.

In practice, the current recommendation is to use ReLU units and use the w = np.random.randn(n) * sqrt(2.0/n), with bias initialization of 0.

Batch Normalization: In the implementation, applying this technique usually amounts to insert the BatchNorm layer immediately after fully connected layers (or convolutional layers, as we’ll soon see), and before non-linearities. We do not expand on this technique here because it is well described in the linked paper, but note that it has become a very common practice to use Batch Normalization in neural networks. In practice networks that use Batch Normalization are significantly more robust to bad initialization. Additionally, batch normalization can be interpreted as doing preprocessing at every layer of the network, but integrated into the network itself in a differentiable manner. Neat!

有以下几种方法来防止网络过拟合。

L2 regularization：这是一种最常见的正则化的形式，他是通过直接 penalize 参数 W 的平方来实现的。形式为12λw2 lamda 为正则化强度，在梯度下降发更新参数时，使用L2意味着，W是线性衰减的。W += -lambda * W towards zero.
L1 regularization：
Max norm constraints：
Dropout：它是一个非常有效并且简单的正则化方法。在训练时，dropout是一一定的概率（超参数）**神经元，剩下神经元都为0. (笔记里有代码)

Lecture 6 Training Neural Network (2)

For classification:

SVM : Li=∑j≠yimax(0,fj−fyi+1)
Softmax classifier :( uses the cross-entropy loss)
Li=−log(efyi∑jefj)
- 笔记里还介绍了一些复杂的逻辑回归的classification,毕设暂时用不到就没有仔细看，后面补上。