Hyperparameter Tuning

Recommended Order:

Firstly
- $\alpha$
Secondly
- $\beta$ of momentum
- #hidden units
- #mini-batch size
Third
- #layers
- learning rate decay

Try random values when choose parameters, and adopt a coarse to fine strategy.

Using an appropriate scale to pick hyperparameters

If we’re trying to decide on the number of layers $L$ in the neural network, maybe the total number of layers should be somewhere between 2 to 4. Then sampling uniformly at random, along 2, 3 and 4, might be reasonable. But if we’re trying to pick a porper value of the parameter $\alpha$ , which we may think it should be somewhere between 0.0001 and 1, and just sampling uniformly at random, well about 90% of the values we sample would be between 0.1 and 1, and only use 10% of the resources to search between 0.0001 and 0.1. we want to search the value of $\alpha$ between 0.1 and 0.01 and between 0.01 and 0.001 and between 0.001 and 0.0001 and so on under the same probability. So we can ue an appropriate scale like this:

$r=-4*np.random.rand()$

$\alpha = 10^{r}$

Hyperparameters tuning in practice: Pandas vs. Caviar

Class2-Week3 Hyperparameter Tuning

Pandas

One way is if you babysit one model. And usually in the situation of maybe a huge data set but not a lot of computational resources, not a lot of CPUs and GPUs, so you can basically afford to train only one model or a very small number of models at a time. In that case you might gradually babysit that model even as it’s training.

Caviar

The other approach would be if you train many models in parallel. So you might have some setting of the hyperparameters and just let it run by itself. You get some learning curve like that; and this could be a plot of the cost function J or cost of your training error or cost of your dataset error, but some metric in your tracking. And then at the same time you might start up a different model with a different setting of the hyperparameters. Or you might train many different models in parallel.

The way to choose between these two approaches is really a function of how much computational resources you have. If you have enough computers to train a lot of models in parallel, then by all means take the caviar approach and try a lot of different hyperparameters and see what works.

Batch Normalization

Implement

In the rise of deep learning, one of the most important ideas has been an algorithm called batch normalization, created by two researchers, Sergey Ioffe and Christian Szegedy. Batch normalization makes your hyperparameter search problem much easier, makes your neural network much more robust. The choice of hyperparameters is a much bigger range of hyperparameters that work well, and will also enable you to much more easily train even very deep networks.

When training a model, such as logistic regression, you might remember that normalizing the input features can speed up learnings. Now we’ve taken these values z and normalized them to have mean 0 and standard unit variance. So every component of z has mean 0 and variance 1. But we don’t want the hidden units to always have mean 0 and variance 1. Maybe it makes sense for hidden units to have a different distribution, so what we’ll do this:

$z^{[l]} = W^{T}A^{[l-1]} \\ \mu = \frac{1}{m}\sum_{i=1}z^{[l](i)} \\ \sigma^{2} = \frac{1}{m} \sum_{i=1}(z^{[l](i)}-\mu)^{2} \\ z_{norm}^{[l](i)} = \frac{z^{[l](i)}-\mu}{\sqrt{\sigma^{2}+\epsilon}} \\ \tilde{z} = \gamma z_{norm}^{[l](i)} + \beta$

$\gamma$ and $\beta$ set the mean and variance of the linear variable $z^{[l]}$

Fitting Batch Norm into a Neural Network

Class2-Week3 Hyperparameter Tuning

$\gamma$ and $\beta$ can be leaned using Adam, Gradient descent with momentum, or RMSprop, not just with gradient descent.

Why does Batch Norm work?

Normalization:
Just like normalizing the input features, the X’s to mean zero and variance one, how that can speed up learning. Batch Norm is doing the similar thing, but futher values in the hidden layers. so it can speed up the learning process.
Learning more robust weights:
It limits the amount to which updating the parameters in the earlier layers can affect the distribution of values that the later layer learn on by. And so, batch norm reduces the problem of the input values changing and the problem of corvariate shift. It really causes these values to become more stable, so that the later layers of the neural network has more firm ground to stand on. And even though the input distribution changes a bit, it changes less, and what this does is, it weakens the coupling between what the early layers parameters have to do and what the later layers parameters have to do. And so it allows each layer of the network to learn by itself, a little bit more independently of other layers, and this has the effect of speeding up of learning in the whole network.
Regularization:
For mini-batches gradient, because the mean and variance computed on just that mini-batch as opposed to computed on the entire data set, that mean and variance has a little bit of noise in it, the scaling process, going from $z^{l}$ to $\tilde{z}^{l}$ , that process has a little bit noisy as well. So similar to dropout, it adds some noise to each hidden layer’s activations, batch norm therefore has a slight regularization effect. it’s forcing the downstream hidden units not to rely too much on any one hidden unit.

Batch Norm at Test Time

After training a neural network with Batch Norm, at test time, to evaluate the neural network on a new example, we should perfrom the needed normalizations, use $\mu$ and $\sigma^{2}$ estimated using an exponentially weighted average across mini-batches seen during training.

Softmax Regression

$a_{j}^{l} = \frac{e^{z_{j}^{l}}}{\sum_{j=1}^{n}e^{z_{j}^{l}}}$

Class2-Week3 Hyperparameter Tuning

文章目录

Hyperparameter Tuning

Recommended Order:

Using an appropriate scale to pick hyperparameters

Hyperparameters tuning in practice: Pandas vs. Caviar

Batch Normalization

Implement

Fitting Batch Norm into a Neural Network

Why does Batch Norm work?

Batch Norm at Test Time

Softmax Regression

相关推荐