您的位置: 首页 > 文章 > Coursera | Andrew Ng (02-week-2-2.4)—理解指数加权平均

Coursera | Andrew Ng (02-week-2-2.4)—理解指数加权平均

分类: 文章 • 2022-12-23 00:14:02

该系列仅在原课程基础上部分知识点添加个人学习笔记，或相关推导补充等。如有错误，还请批评指教。在学习了 Andrew Ng 课程的基础上，为了更方便的查阅复习，将其整理成文字。因本人一直在学习英语，所以该系列以英文为主，同时也建议读者以英文为主，中文辅助，以便后期进阶时，为学习相关领域的学术论文做铺垫。- ZJ

Coursera 课程 |deeplearning.ai |[网易云课堂](https://mooc.study.163.com/smartSpec/detail/ 10 01319001.htm)

转载请注明作者和出处：ZJ 微信公众号-「SelfImprovementLab」

知乎：https://zhuanlan.zhihu.com/c_147249273

****：http://blog.****.net/JUNJUN_ZHAO/article/details/79099001

2.4 Understanding Exponentially weighted averages (理解指数加权平均)

(字幕来源：网易云课堂)

Coursera | Andrew Ng (02-week-2-2.4)—理解指数加权平均

In the last video, we talked about exponentially weighted averages.This will turn out to be a key component of several optimization algorithms that you used to train your neural networks.So, in this video, I want to delve a little bit deeper into intuitions for what this algorithm is really doing.Recall that this is a key equation for implementing exponentially weighted averages.And so, if beta equals 0.9 you got the red line.If it was much closer to one,if it was 0.98 , you get the green line.And it it’s much smaller,maybe 0.5, you get the yellow line.

Coursera | Andrew Ng (02-week-2-2.4)—理解指数加权平均

上个视频中我们讲到了指数加权平均数，这是几个优化算法中的关键一环，而这几个优化算法能帮助你训练神经网络，本视频中我希望进一步探讨算法的本质作用，回忆一下这个计算指数加权平均数的关键方程，beta 为 0.9 的时候得到的结果是红线，如果它更接近于 1，比如 0.98 结果就是绿线，如果 β 小一点，如果是 0.5 结果就是黄线。

Let’s look a bit more than that to understand how this is computing averages of the daily temperature.So here’s that equation again,and let’s set beta equals 0.9 and write out a few equations that this corresponds to.So whereas, when you’re implementing it you have t going from zero to one, to two to three,increasing values of t.To analyze it,I’ve written it with decreasing values of t.And this goes on.So let’s take this first equation here,and understand what v100 really is.So v100 is going to be,let me reverse these two terms,it’s going to be 0.1 times data 100,plus 0.9 times whatever the value was on the previous day.Now, but what is v99?Well, we’ll just plug it in from this equation.So this is just going to be 0.1 times data 99,and again I’ve reversed these two terms,plus 0.9 times v98.But then what is v98?Well, you just get that from here.So you can just plug in here,0.1 times data 98,plus 0.9 times v97, and so on.And if you multiply all of these terms out,you can show that v100 is 0.1 times data 100 plus.Now, let’s look at coefficient on data 99,it’s going to be 0.1 times 0.9, times data 99.Now, let’s look at the coefficient on data 98,there’s a 0.1 here times 0.9, times 0.9. So if we expand out the Algebra,this become 0.1 times 0.9 squared, times data 98.And, if you keep expanding this out,you find that this becomes 0.1 times 0.9 cubed, data 97.plus 0.1 times 0.9 to the fourth,times data 96,plus dot dot dot.So this is really a way to sum and that’s a way to average of data 100,which is the current days temperatureand we’re looking for a perspective of v100 which you calculate on the 100th day of the year.

Coursera | Andrew Ng (02-week-2-2.4)—理解指数加权平均

我们进一步地分析，来理解如何计算出每日温度的平均值，同样的公式，使 β 等于 0.9，写下相应的几个公式，所以在执行的时候，所以 t从 0 到1 到 2 到 3，t 的值在不断增加。为了更好地分析，我写的时候使得 t 的值不断减小，然后继续往下写，首先看第一个公式，v100是什么，v100等于，我们调换一下这两项，0.1 乘以 100号数据，加上 0.9 和前一天数值的乘积，那v99是什么，我们就代入这个公式，所以就是 0.1 乘以 99 号数据，我又把这两项调换了一下，再加上 0.9 和v98的乘积，那v98是什么，你可以用这个公式计算，把公式代进去，0.1 乘以 98 号数据，加上 0.9 乘以 v97以此类推，如果你把这些括号都展开， v100 就是 0.1 乘以 100号数据加上，我们来看 99 号数据的系数，也就是 0.1 乘以 0.9 乘以 99 号数据，再看看 98 号数据的系数，0.1 乘以 0.9 乘以 0.9，若继续展开多项式，也就是 0.1 乘以 0.9 的平方乘以 98 号数据，如果继续展开，就会出现 0.1 乘以 0.9 的三次方乘以 97 号数据，加上 0.1 乘以 0.9的四次方乘以 96 号数据，一直下去，所以这是一个加和并平均， 100号数据也就是当日温度的方法，我们分析 v100 的组成，也就是在一年第 100天计算的数据。

But those are sum of your data 100,data 99, data 98, data 97, data 96, and so on.So one way to draw this in pictures would be if,let’s say we have some number of days of temperature.So this is data and this is t.So data 10 0 will be some value,then data 99 will be some value,data 98, so these are,so this is t equals 100, 99, 98, and so on,Right? It’s some number of days of temperature.And what we have is then an exponentially decaying function.So starting from 0.1,to 0.9 times 0.1,to 0.9 squared times 0.1, to and so on.So you have this exponentially decaying function.And the way you compute v100,is you take the element wise product between these two functions and sum it up.So you take this value, data 100 times 0.1,times, this value of data 99 times 0.1 times 0.9,that’s the second term and so on.So it’s really taking the daily temperature,multiply with this exponentially decaying function, and then summing it up.And this becomes your v100.It turns out that,up to details that are for later.But all of these coefficients,add up to one or add up to very close to one,up to a detail called bias correction which we’ll talk about in the next video.But because of that, this really is an exponentially weighted average.

Coursera | Andrew Ng (02-week-2-2.4)—理解指数加权平均

但是这个是总和包括 100 号数据，99 号数据 98 号数据 97 号数据等等，画图的一个办法是，假设我们有一些日期的温度，所以这是数据这是 t，所以 100号数据有个数值，99号数据有个数值，98号数据等等，t 为 100 99 98等等，这就是数日的温度数值，然后我们构建一个指数衰减函数，从 0.1 开始，到 0.9 乘以 0.1，到 0.9 的平方乘以 0.1 以此类推，所以就有了这个指数衰减函数，计算 v100是通过，把两个函数对应的元素相乘然后求和，用这个数值 100号数据乘以 0.1，99 号数据值乘以 0.1 乘以 0.9，这是第二项以此类推，所以选取的是每日温度，将其与指数衰减函数相乘然后求和，就得到了 v100，结果是，稍后我们详细讲解，不过所有的这些系数，相加起来为1 或者逼近 1，我们称之为偏差修正下个视频会涉及，因为有偏差修正这才是指数加权平均数。

And finally, you might wonder,how many days temperature is this averaging over.Well, it turns out that 0.9 to the power of 10 ,is about 0.35 and this turns out to be about one over e,one of the base of natural algorithms.And, more generally, if you have one minus epsilon,so in this example,epsilon would be 0.1, so this is 0.9,then one minus epsilon to the one over epsilon, this is about one over e,this about 0.34, 0.35.And so, in other words,it takes about 10 days for the height of this todecay to around 1/3 already one over e of the peak.So it’s because of this, that when beta equals 0.9, we say that,this is as if you’re computing an exponentially weighted averagethat focuses on just the last 10 days temperature.Because it’s after 10 daysthat the weight decays to less than about a third of the weight of the current day.

Coursera | Andrew Ng (02-week-2-2.4)—理解指数加权平均

最后也许你会问，到底需要平均多少天的温度，实际上 0.9 的 10 次方，大约为 0.35 这大约是 1/e ，e是自然算法的基础之一，大体上说如果有 1-ε，在这个例子中，ε 是0.1 所以这个是 0.9，(1-ε) 的( 1/ε ) 次方约等于 1/e ，大约是 0.34 0.35，换句话说， 10 天后曲线的高度，下降到三分之一相当于在峰值的 1/e ，又因此当 β=0.9 的时候我们说，仿佛你在计算一个指数加权平均数，只关注了过去 10 天的温度，因为 10 天后，权重下降到不到当日权重的三分之一。

Whereas, in contrast, if beta was equal to 0.98 ,then, well, what do you need 0.98 to the power of in order for this to really small?Turns out that 0.98 to the power of 50 will be approximately equal to one over e.So the way to be pretty big will be bigger than one over e for the first 50 days,and then they’ll decay quite rapidly over that.So intuitively, this is the hard and fast thing,you can think of this as averaging over about 50 days temperature.Because, in this example,to use the notation here on the left,it’s as if epsilon is equal to 0.02,so one over epsilon is 50 .And this, by the way, is how we got the formula,that we’re averaging over one over one minus beta or so days.Right here, epsilon replace a row of 1 minus beta.It tells you, up to some constantroughly how many days temperature you should think of this as averaging over.But this is just a rule of thumb for how to think about it,and it isn’t a formal mathematical statement.Finally, let’s talk about how you actually implement this.Recall that we start over V0 initialized as zero,then compute v one on the first day,v2, and so on.

Coursera | Andrew Ng (02-week-2-2.4)—理解指数加权平均

相反如果 β= 0.98 ，那么 0.98 需要多少次方才能达到这么小的数值？ 0.98 的 50 次方大约等于 1/e ，所以前 50 天这个数值比 1/e 大，数值会快速衰减，所以本质上这是一个下降幅度很大的函数，你可以看作平均了 50 天的温度，因为在例子中，要代入等式的左边，ε=0.02，所以 1/ε 为 50 ，我们由此得到公式，我们平均了大约 1/( 1-β ) 天的温度，这里 ε 替代了 1-β ，也就是说根据一些常数，你能大概知道能够平均多少日的温度，不过这只是思考的大致方向，并不是正式的数学证明，最后讲讲如何在实际中执行，还记得吗我们一开始将v0设置为0，然后计算第一天v1，然后v2以此类推。

Now, to explain the algorithm,it was useful to write down v0, v1, v2, and so on as distinct variables.But if you’re implementing this in practice,this is what you do: you initialize V to be called to zero,and then on day one,you would set v equals beta,times v, plus one minus beta, times data one.And then on the next day, you add update v,to be called to beta V,plus 1 minus beta,data 2, and so on.And some of it uses notation V subscript datato denote that V is computing this exponentially weighted average of the parameter data.So just to say this again but for a new format,you set v data equals zero,and then, repeatedly, have one each day,you would get next data t,and then set to v, data gets updated asbeta, times the old value of v data,plus one minus beta, times the current value of the data.So one of the advantages of this exponentially weighted average formula,s that it takes very little memory.You just need to keep just one row number in computer memory,and you keep on overwriting it with this formula based on the latest values that you got.And it’s really this reason, the efficiency,it just takes up one line of code basically and juststorage and memory for a single row number to compute this exponentially weighted average.It’s really not the best way,not the most accurate way to compute an average.If you were to compute a moving window,where you explicitly sum over the last 10 days,the last 50 days temperaturejust divide by 10 or divide by 50 ,that usually gives you a better estimate.

Coursera | Andrew Ng (02-week-2-2.4)—理解指数加权平均

现在解释一下算法，可以v0v1v_2等等写成明确的变量，不过在实际中执行的话，你要做的是一开始将 v 初始化为0，然后在第一天，使 v 等于 β 乘以 v 加上 ( 1-β ) 乘以 1 号数据，然后第二天更新 v 值，βV+( 1-β ) 乘以 2 号数据以此类推，有些人会把 v 加下标，来表示v是用来计算数据的指数加权平均数，再说一次但是换个说法，v = 0，然后每一天，拿到第 t 天的数据，把 v 更新为，β 乘上旧的v，加上( 1-β ) 乘以第 t 天的数据，指数加权平均数公式的好处之一在于，它只占用极少内存，电脑内存中只占一行数字而已，然后把最新数据代入公式，不断覆盖就可以了，正因为这个原因其效率，它基本上只占一行代码，计算指数加权平均数也只占单行数字的储存和内存，当然它并不是最好的，也不是最精准的计算平均数的方法，如果你要计算移动窗，你直接算出过去 10 天的总和过去 50 天的总和，除以 10 和 50 就好，如此往往会得到更好的估测。

But the disadvantage of that,of explicitly keeping all the temperatures aroundand sum of the last 10 days, is it requires more memory,and it’s just more complicated to implementand is computationally more expensive.So for things, we’ll see some examples on the next few videos,where you need to compute averages of a lot of variables.This is a very efficient way to do soboth from computation and memory efficiency point of view which is why it’s used in a lot of machine learning.Not to mention that there’s just one line of code which is, maybe, another advantage.So, now, you know how to implement exponentially weighted averages.There’s one more technical detail that’s worth for you knowing aboutcalled bias correction.Let’s see that in the next video,and then after that, you will use this to build a better optimization algorithm than the straight forward gradient descent.

Coursera | Andrew Ng (02-week-2-2.4)—理解指数加权平均

但缺点是，如果保存所有最近的温度数据，和过去 10 天的总和，必须占更多的内存，执行更加复杂，计算成本也更加高昂，所以在接下来的视频中，我们会计算多个变量的平均值，从计算和内存效率来说，这是一个有效的方法，所以在机器学习中会经常使用，更不用说只要一行代码，这也是一个优势，现在你学会了计算指数加权平均数，你还需要知道一个专业概念，叫作偏差修正，下一个视频我们会讲到它，接着你就可以用它构建更好的优化算法，而不是简单直接的梯度下降法。

重点总结：

理解指数加权平均

例子，当 β=0.9 时：

v100=0.9v99+0.1θ100v99=0.9v98+0.1θ99v98=0.9v97+0.1θ98…

展开：

v100=0.1θ100+0.9(0.1θ99+0.9(0.1θ98+0.9v97))=0.1θ100+0.1×0.9θ99+0.1×(0.9)2θ98+0.1×(0.9)3θ97+⋯

上式中所有 θ 前面的系数相加起来为 1 或者接近于 1，称之为偏差修正。

总体来说存在，(1−ε)1/ε=1e，在我们的例子中，1−ε=β=0.9，即 0.910≈0.35≈1e。相当于大约 10 天后，系数的峰值（这里是0.1）下降到原来的1e，只关注了过去 10 天的天气。

指数加权平均实现

v0=0v1=βv0+(1−β)θ1v2=βv1+(1−β)θ2v3=βv2+(1−β)θ3…

因为，在计算当前时刻的平均值，只需要前一天的平均值和当前时刻的值，所以在数据量非常大的情况下，指数加权平均在节约计算成本的方面是一种非常有效的方式，可以很大程度上减少计算机资源存储和内存的占用。

参考文献：

[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记（2-2）– 优化算法

PS: 欢迎扫码关注公众号：「SelfImprovementLab」！专注「深度学习」，「机器学习」，「人工智能」。以及「早起」，「阅读」，「运动」，「英语」「其他」不定期建群打卡互助活动。

Coursera | Andrew Ng (02-week-2-2.4)—理解指数加权平均