Coursera | Andrew Ng (02-week-2-2.4)—理解指数加权平均
该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等。如有错误,还请批评指教。在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字。因本人一直在学习英语,所以该系列以英文为主,同时也建议读者以英文为主,中文辅助,以便后期进阶时,为学习相关领域的学术论文做铺垫。- ZJ
Coursera 课程 |deeplearning.ai |[网易云课堂](https://mooc.study.163.com/smartSpec/detail/ 10 01319001.htm)
转载请注明作者和出处:ZJ 微信公众号-「SelfImprovementLab」
知乎:https://zhuanlan.zhihu.com/c_147249273
****:http://blog.****.net/JUNJUN_ZHAO/article/details/79099001
2.4 Understanding Exponentially weighted averages (理解指数加权平均)
(字幕来源:网易云课堂)
In the last video, we talked about exponentially weighted averages.This will turn out to be a key component of several optimization algorithms that you used to train your neural networks.So, in this video, I want to delve a little bit deeper into intuitions for what this algorithm is really doing.Recall that this is a key equation for implementing exponentially weighted averages.And so, if beta equals 0.9 you got the red line.If it was much closer to one,if it was 0.98 , you get the green line.And it it’s much smaller,maybe 0.5, you get the yellow line.
上个视频中 我们讲到了指数加权平均数,这是几个优化算法中的关键一环,而这几个优化算法能帮助你训练神经网络,本视频中我希望进一步探讨算法的本质作用,回忆一下这个计算指数加权平均数的关键方程,beta 为 0.9 的时候 得到的结果是红线,如果它更接近于 1,比如 0.98 结果就是绿线,如果 β 小一点,如果是 0.5 结果就是黄线。
Let’s look a bit more than that to understand how this is computing averages of the daily temperature.So here’s that equation again,and let’s set beta equals 0.9 and write out a few equations that this corresponds to.So whereas, when you’re implementing it you have t going from zero to one, to two to three,increasing values of t.To analyze it,I’ve written it with decreasing values of t.And this goes on.So let’s take this first equation here,and understand what
我们进一步地分析,来理解如何计算出每日温度的平均值,同样的公式,使 β 等于 0.9,写下相应的几个公式,所以在执行的时候,所以 t从 0 到1 到 2 到 3,t 的值在不断增加。为了更好地分析,我写的时候使得 t 的值不断减小,然后继续往下写,首先看第一个公式,
But those are sum of your data 100,data 99, data 98, data 97, data 96, and so on.So one way to draw this in pictures would be if,let’s say we have some number of days of temperature.So this is data and this is t.So data 10 0 will be some value,then data 99 will be some value,data 98, so these are,so this is t equals 100, 99, 98, and so on,Right? It’s some number of days of temperature.And what we have is then an exponentially decaying function.So starting from 0.1,to 0.9 times 0.1,to 0.9 squared times 0.1, to and so on.So you have this exponentially decaying function.And the way you compute
但是这个是总和 包括 100 号数据,99 号数据 98 号数据 97 号数据等等,画图的一个办法是,假设我们有一些日期的温度,所以这是数据 这是 t,所以 100号数据有个数值,99号数据有个数值,98号数据等等,t 为 100 99 98等等,这就是数日的温度数值,然后我们构建一个指数衰减函数,从 0.1 开始,到 0.9 乘以 0.1,到 0.9 的平方乘以 0.1 以此类推,所以就有了这个指数衰减函数,计算
And finally, you might wonder,how many days temperature is this averaging over.Well, it turns out that 0.9 to the power of 10 ,is about 0.35 and this turns out to be about one over e,one of the base of natural algorithms.And, more generally, if you have one minus epsilon,so in this example,epsilon would be 0.1, so this is 0.9,then one minus epsilon to the one over epsilon, this is about one over e,this about 0.34, 0.35.And so, in other words,it takes about 10 days for the height of this todecay to around 1/3 already one over e of the peak.So it’s because of this, that when beta equals 0.9, we say that,this is as if you’re computing an exponentially weighted averagethat focuses on just the last 10 days temperature.Because it’s after 10 daysthat the weight decays to less than about a third of the weight of the current day.
最后也许你会问,到底需要平均多少天的温度,实际上 0.9 的 10 次方,大约为 0.35 这大约是 1/e ,e是自然算法的基础之一,大体上说 如果有 1-ε,在这个例子中,ε 是0.1 所以这个是 0.9,(1-ε) 的( 1/ε ) 次方约等于 1/e ,大约是 0.34 0.35,换句话说, 10 天后 曲线的高度,下降到三分之一 相当于在峰值的 1/e ,又因此当 β=0.9 的时候 我们说,仿佛你在计算一个指数加权平均数,只关注了过去 10 天的温度,因为 10 天后,权重下降到不到当日权重的三分之一。
Whereas, in contrast, if beta was equal to 0.98 ,then, well, what do you need 0.98 to the power of in order for this to really small?Turns out that 0.98 to the power of 50 will be approximately equal to one over e.So the way to be pretty big will be bigger than one over e for the first 50 days,and then they’ll decay quite rapidly over that.So intuitively, this is the hard and fast thing,you can think of this as averaging over about 50 days temperature.Because, in this example,to use the notation here on the left,it’s as if epsilon is equal to 0.02,so one over epsilon is 50 .And this, by the way, is how we got the formula,that we’re averaging over one over one minus beta or so days.Right here, epsilon replace a row of 1 minus beta.It tells you, up to some constantroughly how many days temperature you should think of this as averaging over.But this is just a rule of thumb for how to think about it,and it isn’t a formal mathematical statement.Finally, let’s talk about how you actually implement this.Recall that we start over V0 initialized as zero,then compute v one on the first day,
相反 如果 β= 0.98 ,那么 0.98 需要多少次方才能达到这么小的数值? 0.98 的 50 次方大约等于 1/e ,所以前 50 天这个数值比 1/e 大,数值会快速衰减,所以本质上这是一个下降幅度很大的函数,你可以看作平均了 50 天的温度,因为在例子中,要代入等式的左边,ε=0.02,所以 1/ε 为 50 ,我们由此得到公式,我们平均了大约 1/( 1-β ) 天的温度,这里 ε 替代了 1-β ,也就是说根据一些常数,你能大概知道能够平均多少日的温度,不过这只是思考的大致方向,并不是正式的数学证明,最后讲讲如何在实际中执行,还记得吗 我们一开始将
Now, to explain the algorithm,it was useful to write down
现在 解释一下算法,可以
But the disadvantage of that,of explicitly keeping all the temperatures aroundand sum of the last 10 days, is it requires more memory,and it’s just more complicated to implementand is computationally more expensive.So for things, we’ll see some examples on the next few videos,where you need to compute averages of a lot of variables.This is a very efficient way to do soboth from computation and memory efficiency point of view which is why it’s used in a lot of machine learning.Not to mention that there’s just one line of code which is, maybe, another advantage.So, now, you know how to implement exponentially weighted averages.There’s one more technical detail that’s worth for you knowing aboutcalled bias correction.Let’s see that in the next video,and then after that, you will use this to build a better optimization algorithm than the straight forward gradient descent.
但缺点是,如果保存所有最近的温度数据,和过去 10 天的总和,必须占更多的内存,执行更加复杂,计算成本也更加高昂,所以在接下来的视频中,我们会计算多个变量的平均值,从计算和内存效率来说,这是一个有效的方法,所以在机器学习中会经常使用,更不用说只要一行代码,这也是一个优势,现在你学会了计算指数加权平均数,你还需要知道一个专业概念,叫作偏差修正,下一个视频我们会讲到它,接着你就可以用它构建更好的优化算法,而不是简单直接的梯度下降法。
重点总结:
理解指数加权平均
例子,当 β=0.9 时:
展开:
上式中所有 θ 前面的系数相加起来为 1 或者接近于 1,称之为偏差修正。
总体来说存在,
指数加权平均实现
因为,在计算当前时刻的平均值,只需要前一天的平均值和当前时刻的值,所以在数据量非常大的情况下,指数加权平均在节约计算成本的方面是一种非常有效的方式,可以很大程度上减少计算机资源存储和内存的占用。
参考文献:
[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记(2-2)– 优化算法
PS: 欢迎扫码关注公众号:「SelfImprovementLab」!专注「深度学习」,「机器学习」,「人工智能」。以及 「早起」,「阅读」,「运动」,「英语 」「其他」不定期建群 打卡互助活动。