Coursera | Andrew Ng (02-week-1-1.8)—其他正则化方法
该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等。如有错误,还请批评指教。在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字。因本人一直在学习英语,所以该系列以英文为主,同时也建议读者以英文为主,中文辅助,以便后期进阶时,为学习相关领域的学术论文做铺垫。- Z
转载请注明作者和出处:Z
知乎:https://zhuanlan.zhihu.com/c_147249273
****:http://blog.****.net/junjun_zhao/article/details/79072256
1.8 Other regularization method (其他正则化方法)
(字幕来源:网易云课堂)
In addition to L2 regularization and dropout regularization there are few other techniques for reducing overfitting in your neural network. Let’s take a look. Let’s say you’re fitting a cat classifier. If you are over fitting getting more training data can help, but getting more training data can be expensive and sometimes you just can’t get more data. But what you can do is augment your training set by taking image like this. And for example, flipping it horizontally and adding that also with your training set. So now instead of just this one example in your training set, you can add this to your training example. So by flipping the images horizontally, you could double the size of your training set. Because your training set is now a bit redundant, this isn’t as good as if you had collected an additional set of brand new independent examples. But you could do this without needing to pay the expense of going out to take more pictures of cats.
除了 L2 正则化和 随机失活 (dropout) 正则化 ,还有几种方法可以减少神经网络中的过拟合,我们来看看,假设你正在拟合猫咪图片分类器,如果你想通过扩增训练数据来解决过拟合,但扩增训练数据代价高,而且有时我们无法扩增数据,但我们可以通过添加这类图片来增加训练集,例如 水平翻转图片,并把它添加到训练集,所以现在训练集中有原图,还有翻转后的这张图片,所以 通过水平翻转图片,训练集可以增大一倍,因为训练集有冗余,这虽然不如我们额外收集一组新图片那么好,但是这样做节省了获取更多猫咪图片的花费。
And then other than flipping horizontally, you can also take random crops of the image. So here we’re rotated and sort of randomly zoom into the image, and this still looks like a cat. So by taking random distortions and translations of the image you could augment your data set and make additional fake training examples. Again, these extra fake training examples they don’t add as much information as they were to call they get a brand new independent example of a cat. But because you can do this, almost for free, other than for some confrontational costs. This can be an inexpensive way to give your algorithm more data and therefore sort of regularize it and reduce over fitting. And by synthesizing examples like this, what you’re really telling your algorithm is that if something is a cat, then flipping it horizontally is still a cat.
除了水平翻转图片,你也可以随意裁剪图片,这张图是把原图旋转并随意放大后裁剪的,仍能辨别出图片中的猫咪,通过随意翻转和裁剪图片,我们可以增大数据集 额外生成假训练数据,和全新的独立的猫咪图片数据相比,这些额外的假数据无法那么多信息,但是我们这么做基本没有花费 代价几乎为 0,除了一些对抗性代价,以这种方式扩增算法数据,进而正则化数据集 减少过拟合比较廉价,像这样人工合成数据的话 我们要通过算法验证,图片中的猫经过水平翻转之后依然是猫。
Notice I didn’t flip it vertically, because maybe we don’t want upside down cats, right? And then also maybe randomly zooming in to part of the image, it’s probably still a cat. For optical character recognition, you can also bring your data set by taking digits and imposing random rotations and distortions to it. So if you add these things to your training set, these are also still digit force. For illustration, I applied a very strong distortion. So this look very wavy for, in practice you don’t need to distort the four quite as aggressively, but just a more subtle distortion than what I’m showing here, to make this example clearer for you, right? But a more subtle distortion is usually used in practice, because this looks like really warped fours. So data augmentation can be used as a regularization technique, in fact similar to regularization.
大家注意 我并没有垂直翻转,因为我们不想上下颠倒图片,(Hinton 在 2017 10-26 发表的 capsule nets,可以解决识别 人脸时 五官错乱,图片颠倒的问题)也可以随机选取放大后的部分图片,猫可能还在上面,对于光学字符识别 我们还可以通过添加数字,随意旋转或扭曲数字来扩增数据,把这些数字添加到训练集,它们仍然是数字,为了方便说明,我对字符做了强变形处理,所以数字 4 看起来是波形的 其实不用对数字 4 做这么夸张的扭曲,只要更轻微的变形就好,我做成这样是为了让大家看的更清楚,实际操作的时候 我们通常对字符做更轻微的变形处理,因为这几个 4 看起来有点扭曲,所以 数据扩增可作为正则化方法使用。实际功能上也与正则化相似。
There’s one other technique that is often used called early stopping. So what you’re going to do is as you run gradient descent, you’re going to plot your, either the training error, you’ll use 01 classification error on the training set, or just plot the cost function
还有另外一种常用的方法叫作 early stopping,运行梯度下降时,我们可以绘制训练误差,或只绘制代价函数
Well when you’ve haven’t run many iterations for your neural network yet, your parameters w will be close to zero. Because with random initialization, you probably initialize w to small random values,so before you train for a long time, w is still quite small. And as you iterate, as you train, w will get bigger and bigger and bigger until here maybe you have a much larger value of the parameters w for your neural network. So what early stopping does is by stopping halfway you have only a mid-size rate w. And so similar to L2 regularization by picking a neural network with smaller norm for your parameters w, hopefully your neural network is over fitting less. And the term early stopping refers to the fact that you’re just stopping the training of your neural network earlier. I sometimes use early stopping when training a neural network.
当你还未在神经网络上运行太多迭代过程的时候,参数 w 接近 0,因为随机初始化 w 值时 它的值可能都是较小的随机值,所以在你长期训练神经网络之前 w依然很小,因为随机初始化 w 值时 它的值可能都是较小的随机值,所以在你长期训练神经网络之前 w 依然很小,在迭代过程和训练过程中 w 的值会变得越来越大,比如在这儿 神经网络中参数 w 的值已经非常大了,所以 early stopping 要做就是在中间点停止迭代过程,我们得到一个 W 值中等大小的弗罗贝尼乌斯范数,与 L2 正则化相似 选择参数 W 范数较小的神经网络,但愿你的神经网络过度拟合不严重,术语 early stopping 代表,提早停止训练神经网络,训练神经网络时 我有时会用到 early stopping。
But it does have one downside, let me explain. I think of the machine learning process as comprising several different steps. One, is that you want an algorithm to optimize the cost function j and we have various tools to do that, such as Gradient descent. And then we’ll talk later about other algorithms, like Momentum and RMSprop and Adam, and so on. But after optimizing the cost function j, you also wanted to not over-fit. And we have some tools to do that such as your regularization, getting more data and so on. Now in machine learning, we already have so many hyper-parameters surge over. It’s already very complicated to choose among the space of possible algorithms. And so I find machine learning easier to think about when you have one set of tools for optimizing the cost function
但是它也有一个缺点 我们来了解一下,我认为机器学习过程包括几个步骤,其中一步是选择一个算法来优化代价函数
And when you’re doing that, you have a separate set of tools for doing it. And this principle is sometimes called Orthogonalization. And this is an idea that you want to be able to think about one task at a time. I’ll say more about Orthogonalization in a later video, so if you don’t fully get the concept yet, don’t worry about it. But, to me the main downside of early stopping is that this couples, these two tasks. So you no longer can work on these two problems independently, because by stopping gradient descent early, you’re sort of breaking whatever you’re doing to optimize cost function
这一步我们用另外一套工具来实现,这个原理有时被称为“正交化”,思路就是在一个时间做一个任务,后面课上我会具体介绍正交化,如果你还不了解这个概念 不用担心,但对我来说 early stopping 的主要缺点是,你不能独立地处理这两个问题, 因为提早停止梯度下降,也就是停止了优化代价函数
Rather than using early stopping, one alternative is just use L2 regularization then you can just train the neural network as long as possible. I find that this makes the search space of hyper parameters easier to decompose, and easier to search over. But the downside of this though is that you might have to try a lot of values of the regularization parameter lambda. And so this makes searching over many values of lambda more computationally expensive. And the advantage of early stopping is that running the gradient descent process just once, you get to try out values of small w, mid-size w, and large w, without needing to try a lot of values of the L2 regularization hyperparameter lambda. If this concept doesn’t completely make sense to you yet, don’t worry about it. We’re going to talk about orthogonalization in greater detail in a later video, I think this will make a bit more sense. Despite it’s disadvantages, many people do use it.
如果不用 early stopping 另一种方法就是 L2 正则化,训练神经网络的时间就可能很长,我发现 这导致超级参数搜索空间更容易分解,也更容易搜索,但是缺点在于 你必须尝试,很多正则化参数 λ 的值,这也导致搜索大量 λ 值的计算代价太高,early stopping 的优点是 只运行一次坡度下降,你可以找出 W 的较小值 中间值和较大值,而无需尝试 L2 正则化超级参数 λ 的很多值,如果你还不能完全这个概念 没关系,下节课我们会详细正交化,这样会更好理解,虽然 L2 正则化有缺点 课还是有很多人愿意用它。
I personally prefer to just use L2 regularization and try different values of lambda. That’s assuming you can afford the computation to do so. But early stopping does let you get a similar effect without needing to explicitly try lots of different values of lambda. So you’ve now seen how to use data augmentation as well as if you wish early stopping in order to reduce variance or prevent over fitting in your neural network. Next let’s talk about some techniques for setting up your optimization problem to make your training go quickly.
我个人更倾向于使用 L2 正则化 尝试许多不同的 λ 值,假设你可以负担大量计算的代价,而使用 early stopping 也能得到相似结果,还不用尝试这么多 λ 值,这节课我们讲了如何使用数据扩增,以及如何使用 early stopping 降低神经网络中的方差, 或预防过拟合,下节课 我们会讲一些,配置优化问题的方法 来提高训练速度。
重点总结:
其他正则化方法
- 数据扩增(Data augmentation):通过图片的一些变换,得到更多的训练集和验证集;
- Early stopping:在交叉验证集的误差上升之前的点停止迭代,避免过拟合。这种方法的缺点是无法同时解决 bias 和 variance 之间的最优。
参考文献:
[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记(2-1)– 深度学习的实践方面
PS: 欢迎扫码关注公众号:「SelfImprovementLab」!专注「深度学习」,「机器学习」,「人工智能」。以及 「早起」,「阅读」,「运动」,「英语 」「其他」不定期建群 打卡互助活动。