Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程,第五章《多变量线性回归》中第30课时《多元梯度下降法实践 I: 特征缩放》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正,使其更加简洁,方便阅读,以便日后查阅使用。现分享给大家。如有错误,欢迎大家批评指正,在此表示诚挚地感谢!同时希望对大家的学习能有所帮助。

In this video(article) and the video (article) after this one, I wanna tell you about some of the practical tricks for making gradient descent work well. In this video (article), I want to tell you about an idea called feature scaling. Here is the idea.

Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling

If you have a problem where you have multiple features, if you make sure that the features are on a similar scale, by which I mean make sure that the different features take on similar range of values, then gradient descent can converge more quickly. Concretely, let's say you have a problem with two features where Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling is the size of house and takes on values between say Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling to Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling, and Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling is the number of bedrooms and maybe that takes on values between Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling and Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling. If you plot the contours of the cost function Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling, then the contours may look like this, where, let's see, Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling is a function of parameters Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling, Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling and Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling. I'm going to ignore Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling, so let's forget about Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling and pretend as a function of only Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling and Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling. If Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling can take on, you know, much larger range of values than Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling, it turns out that the contours of the cost function Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling can take on this sort of very very skewed elliptical shape, except that with the Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling to Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling ratio, it can be even more skewed. So, this is very very tall and skinny ellipses, or these very tall skinny ovals, can form the contours of the cost function Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling. And if you run gradient descents on this sort of cost function, your gradients may end up taking a long time, and can oscillate back and forth and take a long time before it can finally find its way to the global minimum. In fact, you can imagine if these contours are exaggerated even more when you draw incredibly tall skinny contours, and it can be even more extreme than that, then it turns out gradient descent will just have a much harder time taking its way, meandering around, it can take a long time to find its way to the global minimum.

In these settings, a useful thing to do is to scale the features. Concretely, if you instead define the feature Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling to be the size of the house divided by Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling, and define Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling to be maybe the number of bedrooms divided by Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling, then the contours of the cost function Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling can become more less skewed, so the contours may look more like circles. And if you run gradient descent on a cost function like this, then gradient descent, you can show mathematically, can find a much more direct path to the global minimum, rather than taking a much more convoluted path trying to follow a much more complicated trajectory to get to the global minimum. So, by scaling the features so that they take on similar ranges of values. In this example, we end up with both features, Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling and Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling, between Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling and Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling. You can wind up with an implementation of gradient descent. They can converge much faster.

Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling

More generally, when we're performing features scaling, what we often want to do is get every feature into approximately a Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling to Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling range. And concretely, your feature Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling is always equal to Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling. So, that's already in that range, but you may end up with dividing other features by different numbers to get them into this range. And the numbers Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling and Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling aren't too important. So, if you have a feature Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling that winds up being between Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling and Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling, say, that's not a problem. If you end up having a different feature that winds up being between Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling and Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling, again, this is close enough to Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling and Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling, that's fine. It's only if you have a different feature Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling, say, that ranges from Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling to Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling, then this is a very different values thanLinear Regression with multiple variables - Gradient descent in practice I: Feature Scaling and Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling. So, this might be a less well-scaled feature. And similarly, if your features take on a very, very small range of values, so if Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling takes on values between Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling and Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling, then, again this takes on a much smaller range of values than Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling and Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling range. And again I would consider this feature poorly scaled. So you want the range of values, it can be bigger than Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling or smaller than Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling, but just not much bigger, like Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling here, or too much smaller like Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling one over there. Different people have different rules of thumb. But the one that I use is that if a feature takes on the range of values from sayLinear Regression with multiple variables - Gradient descent in practice I: Feature Scaling to Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling, I usually think that should be just fine, but maybe it takes on much larger values than Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling or Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling I start to worry. And if it takes on values from say Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling to Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling. You know I think that's fine too, or Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling to Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling or Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling to Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling. I guess that's typical range of values I consider okay. But if it takes on a much tinier range of values like Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling here then again you start to worry. So, the take-home message is don't worry if your features are not exactly on the same scale or exactly in the same range of values. But so long as they're all close enough to this gradient descent should work okay.

Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling

In addition to dividing by the maximum value when performing feature scaling, sometimes people will also do what's called the mean normalization. And what I mean by that is that you want to take a feature Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling and replace it with Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling to make your features to have approximately Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling mean. And obviously we won't apply this to feature Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling, because the feature Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling is always equal to Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling so it cannot have an average value of Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling. But, concretely, for other features if the range of sizes of the house takes on values between Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling to Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling and, if you know, the average size of a house is equal to Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling. Then you might use this formula, set the feature Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling. And similarly, if your houses have one to five bedrooms and if on average a house has two bedrooms, then you might use this formula to mean-normalize your second feature Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling. In both of these cases, you therefore end up with features Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling and Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling. They can take on values roughly between Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling and Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling. Exactly not true - Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling can actually be slightly larger than Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling, but you know, close enough.

And the more general rule is that you might take a feature Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling and replace it with Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling where to define these terms:

- Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling is the average value of Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling in the training sets,

- and Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling is the range of values of that feature, and by range, I mean the maximum value minus the minimum value, or for those of you that understand the deviation of the variable, setting Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling to be the standard deviation of the variable would be fine, too. But taking this max minus min would be fine.

And similarly for the second feature Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling, you can replace Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling with Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling. And this sort of formula will get your features, you know, maybe not exactly, but maybe roughly into these sort of ranges. By the way, for those of you that are being super careful, technically if we're taking the range as max minus min, this Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling here will actually become a Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling. So if max is Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling, min is Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling, then the range of their own values is actually equal to Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling, but all of these are approximate and any value that gets the features into anything close to these sorts of ranges will do fine. And the feature scaling doesn't have to be too exact, in order to get gradient descent to run quite a lot faster.

So, now you know about feature scaling and if you apply this simple trick, it can make gradient descent run much faster and converge in a lot fewer iterations. In the next video, I will tell you about another trick to make gradient descent work well in practice.

<end>