Coursera | Andrew Ng (02-week-1-1.4)—正则化

该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等。如有错误,还请批评指教。在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字。因本人一直在学习英语,所以该系列以英文为主,同时也建议读者以英文为主,中文辅助,以便后期进阶时,为学习相关领域的学术论文做铺垫。- ZJ

Coursera 课程 |deeplearning.ai |网易云课堂


转载请注明作者和出处:ZJ 微信公众号-「SelfImprovementLab」

知乎https://zhuanlan.zhihu.com/c_147249273

****http://blog.****.net/junjun_zhao/article/details/79064170


1.4 Regularization (正则化)

(字幕来源:网易云课堂)

Coursera | Andrew Ng (02-week-1-1.4)—正则化

If you suspect your neural network is over fitting your data,that is you have a high variance problem,one of the first things you should try is probably regularization.The other way to address high variance is to get more training data, that’s also quite reliable.But you can’t always get more training data, or it could be expensive to get more data.But adding regularization will often help to prevent overfitting, orto reduce the errors in your network.So let’s see how regularization works.Let’s develop these ideas using logistic regression.

如果你怀疑神经网络过度拟合了数据,即存在高方差问题,那么最先想到的方法可能是正则化,另一个解决高方差的方法就是准备更多数据 这也是非常可靠的办法,但你可能无法时时准备足够多的训练数据,或者 获取更多数据的成本很高,但正则化通常有助于避免过度拟合,或减少网络误差,下面我们就来讲讲正则化的作用原理。

Recall that for logistic regression, you try to minimize the cost function J,which is defined as this cost function.Some of your training examples of the losses of the individual predictions inthe different examples, where you recall that w and b in the logistic regression are the parameters.So w is an x-dimensional parameter vector, and b is a real number.And so to add regularization to the logistic regression, what you do is add to it,this thing, lambda, which is called the regularization parameter.I’ll say more about that in a second.But lambda/2m times the norm of w squared.So here, the norm of w squared is just equal to sum from j equals 1 to nx of wj squared,or this can also be written w transpose w,it’s just a square Euclidean norm of the parameter vector w.And this is called L2 regularization.Because here, you’re using the Euclidean normals, justcalled the L2 norm with the parameter vector w.

Coursera | Andrew Ng (02-week-1-1.4)—正则化

我用逻辑回归来实现这些设想,求成本函数 J 的最小值,它是我们定义的成本函数,参数包含一些训练数据和不同数据中个体预测的损失,w 和 b 是逻辑回归的两个参数,w 是一个多维度参数矢量 b 是一个实数,在逻辑回归函数中加入正则化,只需添加参数 λ 也就是正则化参数,一会儿再详细讲,λ2m 乘以w 平方的范数,w 欧几里德范数的平方等于 wj(j 值从1到 nx)平方的和,也可以表示为wTw,也就是向量参数w的欧几里德范数平方,此方法称为L2正则化,因为这里用了欧几里德法线,被称为向量参数W的L2范数。

Now, why do you regularize just the parameter w?Why don’t we add something here about b as well?In practice, you could do this, but I usually just omit 省略不写 this.Because if you look at your parameters, w is usually a pretty high dimensionalparameter vector, especially with a high variance problem.Maybe w just has a lot of parameters, soyou aren’t fitting all the parameters well, whereas b is just a single number.So almost all the parameters are in w rather than b.And if you add this last term,in practice, it won’t make much of a difference,because b is just one parameter over a very large number of parameters.In practice, I usually just don’t bother to include it.But you can if you want.

Coursera | Andrew Ng (02-week-1-1.4)—正则化

为什么只正则化参数 w,为什么不再加上参数 b 呢,你可以这么做 只是我习惯省略不写,因为 w 通常是一个高维参数矢量,已经可以表达高偏差问题,W可能含有很多参数,我们不可能拟合所有参数 而 b 只是单个数字,所以 w 几乎涵盖所有参数 而不是 b,如果加了参数 b,其实也没什么太大影响,因为 b 只是众多参数中的一个,所以我通常省略不计,如果你想加上这个参数 完全没问题。

So L2 regularization is the most common type of regularization.You might have also heard of some people talk about L1 regularization.And that’s when you add, instead of this L2 norm,you instead add a term that is lambda/m of sum over of this.And this is also called the L1 norm of the parameter vector w,so the little subscript 1 down there, right?And I guess whether you put m or 2m in the denominator, is just a scaling constant.If you use L1 regularization, then w will end up being sparse.And what that means is that the w vector will have a lot of zeros in it.And some people say that this can help with compressing the model, because the set of parameters are zero, and you need less memory to store the model.Although, I find that, in practice, L1 regularization to make your model sparse,helps only a little bit.So I don’t think it’s used that much,at least not for the purpose of compressing your model.And when people train your networks,L2 regularization is just used much much more often.Sorry, just fixing up some of the notation here.

Coursera | Andrew Ng (02-week-1-1.4)—正则化

L2正则化是最常见的正则化类型,你们可能听说过L1正则化,L1正则化加的不是L2范数,而是正则项 λm 乘以 W 范数从 j=1 到 nx的和,也被称为参数 w 向量的L1范数,这里的下标是1,无论分母是 m 还是 2m 它都是一个比例常量,如果用的是L1正则化 W 最终会是稀疏的,也就是说 W 向量中有很多 0,有人说这样有利于压缩模型,因为集合中参数均为 0 存储该模型所占用的内存更少,实际上 虽然L1正则化使模型变得稀疏,却没有降低太多存储内存,所以我认为这并不是L1正则化的目的,至少不是为了压缩模型,人们在训练网络时,越来越倾向于使用L2正则化,不好意思 有几个符号要改一下。

So one last detail.Lambda here is called the regularization, Parameter.And usually, you set this using your development set,or using cross validation.When you try a variety of values and see what does the best,in terms of trading off between doing well in your training set versus alsosetting that two normal of your parameters to be small,which helps prevent over fitting.So lambda is another hyper parameter that you might have to tune.And by the way, for the programming exercises,lambda is a reserved keyword in the Python programming language.So in the programming exercise, we’ll have lambd,without the a, so as not to clash with the reserved keyword in Python.So we use lambd to represent the lambda regularization parameter.So this is how you implement L2 regularization for logistic regression.

Coursera | Andrew Ng (02-week-1-1.4)—正则化

我们来看最后一个细节,λ是正则化参数,我们通常使用验证集或交叉验证,来配置这个参数,尝试各种各样的数据 寻找最好的参数,我们要考虑训练集之间的权衡,把参数正常值设置为较小值,这样可以避免过拟合,λ 是另外一个需要调整的超级参数,顺便说一下 为了方便编写代码,在 Python 编程语言中 λ 是一个保留字段,编写代码时 我们删掉 a 写成 lambd,以免与 Python 中的保留字段冲突,我们用 lambd 来代替 lambda 正则化参数,这就是在逻辑回归函数中实现L2正则化的过程。

How about a neural network?In a neural network, you have a cost function that’s a function of all of your parameters, w[1], b[1] through w[L], b[L],where capital L is the number of layers in your neural network.And so the cost function is this, sum of the losses,summed over your m training examples.And says at regularization, you add lambda over 2m of sum over all of your parameters W, your parameter matrix is w,of their, that’s called the squared norm.Where this norm of a matrix, meaning the squared norm is defined as the sum of the i, sum of j, of each of the elements of that matrix, squared.And if you want the indices of this summation,this is sum from i=1 through n[l1].Sum from j=1 through n[l],because w is a n[l1] by n[l]dimensional matrix,where these are the number of hidden units in the number of units [l-1] in layer l.So this matrix norm, it turns out is called the Frobenius norm of the matrix, denoted with a F in the subscript.So for arcane linear algebra technical reasons,this is not called the l2 normal of a matrix.Instead, it’s called the Frobenius norm of a matrix.I know it sounds like it would be more natural to just call the l2 norm of the matrix,but for really arcane reasons that you don’t need to know,by convention, this is called the Frobenius norm.It just means the sum of square of elements of a matrix.

Coursera | Andrew Ng (02-week-1-1.4)—正则化

如何在神经网络中实现L2正则化呢,神经网络含有一个成本函数,该函数中包含从w[1]b[1]w[L]b[L]所有参数,字母L是神经网络所含的层数,因此成本函数等于,损失总和乘以训练数据 m 的总和,正则项为,λ2m 乘以参数矩阵 W 的总和,我们称||W[l]||2为范数平方,这个矩阵范数(即平方范数),被定义为矩阵中所有元素的平方求和,我们看下求和公式的具体参数,第一个求和符号其 i 值从 1 到n[l1],第二个其 j 值从 1到 n[l],因为 W 是一个n[l1] x n[l]的多维矩阵,n[l1]表示隐藏单元的数量 n[l]表示 l 层单元的数量,该矩阵范数被称作“弗罗贝尼乌斯范数”,用下标 F 标注,鉴于线性代数中一些神秘晦涩的原因,我们不称之为“矩阵L2范数”,而称它为“弗罗贝尼乌斯范数”,矩阵L2范数听起来更自然,但鉴于一些大家无须知道的特殊原因,按照惯例 我们称之为“弗罗贝尼乌斯范数”,它表示一个矩阵中所有元素的平方和。

So how do you implement gradient descent with this?Previously, we would compute dw using backprop,where backprop would give us the partial derivative of J with respect to w,or really w for any given [l].And then you update w[l], as w[l]minus the learning rate times d.So this is before we added this extra regularization term to the objective.Now that we’ve added this regularization term to the objective,what you do is you take dw and you add to it, lambda/m times w.And then you just compute this update, same as before.And it turns out that with this new definition of dw[l],this new dw[l] is still a correct definition of the derivative of your cost function,with respect to your parameters,now that you’ve added the extra regularization term at the end.And it’s for this reason that L2 regularization is sometimes also called weight decay.So if I take this definition of dw[l]and just plug it in here,then you see that the update is w[l]=w[l]timesthe learning rate alpha times the thing from backprop, +lambda of m times w[l].Throw the minus sign there.And so this is equal to w[l]- alpha, lambda / m times w[l]- alpha times the thing you got from backpop.And so this term shows that whatever the matrix w{[l] }is,you’re going to make it a little bit smaller, right?This is actually as if you’re taking the matrix w andyou’re multiplying it by 1-alpha lambda/m.You’re really taking the matrix w and subtracting alpha lambda/m times this.Like you’re multiplying matrix w by this number,which is going to be a little bit less than 1.So this is why L2 norm regularization is also called weight decay.

Coursera | Andrew Ng (02-week-1-1.4)—正则化

如何使用该范数实现梯度下降呢,用 backprop 计算出 dw 的值,backprop 会给出 j 对 w 的偏导数,实际上是w[l],把w[l]替换为w[l]减去学习率乘以 d,这就是之前我们额外增加的正则化项,既然已经增加了这个正则化项,现在我们要做的就是给 dw 加上这一项 λm 乘以 w,然后计算这个更新项,使用新定义的dw[l],它的定义含有代价函数导数和,相关参数,以及最后添加的额外正则项,这也是L2正则化有时被称为,“权重衰减”的原因,我们用dw[l]的定义替换此处的dw[l],可以看到 w[l]的定义被更新为,w[l] 学习率 α 乘以 backprop 再加上 λmw[l],这儿写个减号,它等于 w[1] 减去 αλm w[l] 然后减去,α 乘以 backprop 的输出,该正则项说明 不论w[l]是什么,我们都试图让它变得更小,实际上 相当于我们给矩阵W,乘以了(1- αλm) 倍的权重,矩阵 W 减去 αλm 倍的它,也就是用这个系数乘以矩阵 W,该系数小于 1,因此 L2范数正则化也被称为“权重衰减”,

Because it’s just like the ordinarily gradient descent, where you update w by subtracting alpha times the original gradient you got from backprop.But now you’re also multiplying w by this thing,which is a little bit less than 1.So the alternative name for L2 regularization is weight decay.I’m not really going to use that name, but the intuition for why it’s called weight decay is that this first term here is equal to this.So you’re just multiplying the weight metrics by a number slightly less than 1.So that’s how you implement L2 regularization in the neural network. Now, one question that peers sometimes ask me is, hey, Andrew,why does regularization prevent over-fitting?Let’s take a quick look at the next video,and gain some intuition for how regularization prevents over-fitting.

因为它就像一般的梯度下降,w 被更新为少了 α 乘以 backprop 输出的最初梯度值,同时 w 也乘以了这个系数,这个系数小于 1,因此L2正则化也被称为“权重衰减”,我不打算这么叫它,之所以叫它“权重衰退“是因为这两项相等,权重指标乘以了一个小于 1 的系数,以上就是在神经网络中应用L2正则的过程,有人会问我,为什么正则化可以预防过拟合,我们放在下节课讲,同时直观感受一下正则化是如何预防过拟合的。


重点总结:

正则化(regularization)

利用正则化来解决 High variance 的问题,正则化是在 Cost function 中加入一项正则化项,惩罚模型的复杂度。

Logistic regression

加入正则化项的代价函数:

J(w,b)=1mi=1mL(y^(i),y(i))+λ2m||w||22

上式为逻辑回归的 L2正则化。

  • L2 正则化:λ2m||w||22=λ2mj=1nxw2j=λ2mwTw

  • L1正则化:λ2m||w||1=λ2mj=1nx|wj|

其中 λ 为正则化因子。

注意:lambda 在python中属于保留字,所以在编程的时候,用“lambd”代表这里的正则化因子λ。

Neural network

加入正则化项的代价函数:

J(w[1],b[1],,w[L],b[L])=1mi=1ml(y^(i),y(i))+λ2ml=1L||w[l]||2F

其中 ||w[l]||2F=i=1n[l1]j=1n[l](w[l]ij)2 ,因为 w 的大小为 (n[l1],n[l]) ,该矩阵范数被称为“Frobenius norm

Weight decay

在加入正则化项后,梯度变为:

dW[l]=(form_backprop)+λmW[l]

则梯度更新公式变为:

W[l]:=W[l]αdW[l]

代入可得:

W[l]:=W[l]α[(form_backprop)+λmW[l]]=W[l]αλmW[l]α(form_backprop)=(1αλm)W[l]α(form_backprop)

其中,(1αλm)为一个<1的项,会给原来的W[l]一个衰减的参数,所以 L2 范数正则化也被称为“权重衰减Weight decay)”。

参考文献:

[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记(2-1)– 深度学习的实践方面


PS: 欢迎扫码关注公众号:「SelfImprovementLab」!专注「深度学习」,「机器学习」,「人工智能」。以及 「早起」,「阅读」,「运动」,「英语 」「其他」不定期建群 打卡互助活动。

Coursera | Andrew Ng (02-week-1-1.4)—正则化