2.1 Mini-batch 梯度下降

Applying machine learning is a highly empirical process, is highly iterative process. In which you just had to train a lot of models to find one that works really well. So, it really helps to really train models quickly.

It turns out that you can get a faster algorithm if you let gradient descent start to make some progress even before you finish processing your entire , your giant training sets of 5 million examples. In particular, here’s what you can do.

$m$ training samples:
2.1 Mini-batch 梯度下降

$X : (n_{X}, m), Y : (1, m)$

Let’s say that you split up your training set into smaller, little baby training sets and these baby training sets are called mini-batches. And let’s say each of your baby training sets have just 1000 examples each. So, you take X1 through X1000 (记作 $x^{{1}}$ )and you call that your first little baby training set, also call the mini-batch. And then you take home the next 1000 examples. X1001 through X2000 ( $x^{{2}}$ ) and then X1000 examples and come next one and so on. Altogether you would have 5,000 of these mini batches.

You would also split up your training data for Y accordingly.

$X^{{t}} : (n_{X}, 1000), Y : (1, 1000)$

下标含义：

$x^{(i)}$ : 第i个样本
$z^{[l]}$ : 第 $l$ 层的隐藏单元
$x^{{t}}, y^{{t}}$ : 第t个mini-batch

To explain the name of this algorithm, batch gradient descent, refers to the gradient descent algorithm we have been talking about previously. Where you process your entire training set all at the same time. And the name comes from viewing that as processing your entire batch of training samples all at the same time. I know it’s not a great name but that’s just what it’s called. Mini-batch gradient descent in contrast, refers to algorithm which we’ll talk about on the next slide and which you process is single mini batch XT, YT at the same time rather than processing your entire training set XY the same time.

2.1 Mini-batch 梯度下降

伪代码：
repeat {
for t=1, $\dots$ , 5000:

Forward prop on $X^{{t}}$ (Vectorized implementation on 1000 examples):
$\begin{aligned} Z^{[1]} = W^{[1]} X^{{t}} + b^{[l]} \\ A^{[1]} = g^{[1]} (Z^{[1]}) \\ ⋮ \\ A^{[l]} = g^{[l]} (Z^{[l]}) \end{aligned}$
Compute cost
$J^{{t}} = \frac{1}{1000} \sum_{i} L ({\hat{y}}^{(i)}, y^{(i)}) + \frac{λ}{2 * 1000} \sum_{l} | | w^{[l]} | |^{2}$
Back prop to compute gradients on $J^{{t}}$ (using $X^{{t}}, Y^{{t}}$ )
$W^{[l]} = W^{[l]} - α d W^{[l]},, b^{[l]} = b^{[l]} - α d b^{[l]}$
}

This is one pass through your training set using mini-batch gradient descent. The code I have written down here is also called doing one epoch of training and epoch is a word that means a single pass through the training set. Whereas with batch gradient descent, a single pass through the training allows you to take only one gradient descent step. With mini-batch gradient descent, a single pass through the training set, that is one epoch, allows you to take 5,000 gradient descent steps. Now of course you want to take multiple passes through the training set which you usually want to, you might want another for loop for another while loop out there. So you keep taking passes through the training set until hopefully you converge with approximately converge. When you have a lost training set, mini-batch gradient descent runs much faster than batch gradient descent and that’s pretty much what everyone in Deep Learning will use when you’re training on a large data set.

2.1 Mini-batch 梯度下降

相关推荐