Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程，第五章《多变量线性回归》中第30课时《多元梯度下降法实践 I: 特征缩放》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正，使其更加简洁，方便阅读，以便日后查阅使用。现分享给大家。如有错误，欢迎大家批评指正，在此表示诚挚地感谢！同时希望对大家的学习能有所帮助。

In this video(article) and the video (article) after this one, I wanna tell you about some of the practical tricks for making gradient descent work well. In this video (article), I want to tell you about an idea called feature scaling. Here is the idea.

If you have a problem where you have multiple features, if you make sure that the features are on a similar scale, by which I mean make sure that the different features take on similar range of values, then gradient descent can converge more quickly. Concretely, let's say you have a problem with two features where Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling is the size of house and takes on values between say to , and is the number of bedrooms and maybe that takes on values between and . If you plot the contours of the cost function , then the contours may look like this, where, let's see, Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling is a function of parameters , and . I'm going to ignore , so let's forget about and pretend as a function of only and . If can take on, you know, much larger range of values than , it turns out that the contours of the cost function Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling can take on this sort of very very skewed elliptical shape, except that with the to ratio, it can be even more skewed. So, this is very very tall and skinny ellipses, or these very tall skinny ovals, can form the contours of the cost function Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling . And if you run gradient descents on this sort of cost function, your gradients may end up taking a long time, and can oscillate back and forth and take a long time before it can finally find its way to the global minimum. In fact, you can imagine if these contours are exaggerated even more when you draw incredibly tall skinny contours, and it can be even more extreme than that, then it turns out gradient descent will just have a much harder time taking its way, meandering around, it can take a long time to find its way to the global minimum.

In these settings, a useful thing to do is to scale the features. Concretely, if you instead define the feature Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling to be the size of the house divided by , and define to be maybe the number of bedrooms divided by , then the contours of the cost function can become more less skewed, so the contours may look more like circles. And if you run gradient descent on a cost function like this, then gradient descent, you can show mathematically, can find a much more direct path to the global minimum, rather than taking a much more convoluted path trying to follow a much more complicated trajectory to get to the global minimum. So, by scaling the features so that they take on similar ranges of values. In this example, we end up with both features, Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling and , between and . You can wind up with an implementation of gradient descent. They can converge much faster.

More generally, when we're performing features scaling, what we often want to do is get every feature into approximately a Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling to range. And concretely, your feature is always equal to . So, that's already in that range, but you may end up with dividing other features by different numbers to get them into this range. And the numbers Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling and aren't too important. So, if you have a feature that winds up being between and , say, that's not a problem. If you end up having a different feature that winds up being between and , again, this is close enough to Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling and , that's fine. It's only if you have a different feature , say, that ranges from to , then this is a very different values than and . So, this might be a less well-scaled feature. And similarly, if your features take on a very, very small range of values, so if Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling takes on values between and , then, again this takes on a much smaller range of values than and range. And again I would consider this feature poorly scaled. So you want the range of values, it can be bigger than Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling or smaller than , but just not much bigger, like here, or too much smaller like one over there. Different people have different rules of thumb. But the one that I use is that if a feature takes on the range of values from say Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling to , I usually think that should be just fine, but maybe it takes on much larger values than or I start to worry. And if it takes on values from say to . You know I think that's fine too, or to or to . I guess that's typical range of values I consider okay. But if it takes on a much tinier range of values like Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling here then again you start to worry. So, the take-home message is don't worry if your features are not exactly on the same scale or exactly in the same range of values. But so long as they're all close enough to this gradient descent should work okay.

In addition to dividing by the maximum value when performing feature scaling, sometimes people will also do what's called the mean normalization. And what I mean by that is that you want to take a feature Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling and replace it with to make your features to have approximately mean. And obviously we won't apply this to feature , because the feature is always equal to so it cannot have an average value of . But, concretely, for other features if the range of sizes of the house takes on values between Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling to and, if you know, the average size of a house is equal to . Then you might use this formula, set the feature . And similarly, if your houses have one to five bedrooms and if on average a house has two bedrooms, then you might use this formula to mean-normalize your second feature Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling . In both of these cases, you therefore end up with features and . They can take on values roughly between and . Exactly not true - can actually be slightly larger than , but you know, close enough.

And the more general rule is that you might take a feature Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling and replace it with where to define these terms:

- Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling is the average value of in the training sets,

- and Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling is the range of values of that feature, and by range, I mean the maximum value minus the minimum value, or for those of you that understand the deviation of the variable, setting to be the standard deviation of the variable would be fine, too. But taking this max minus min would be fine.

And similarly for the second feature Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling , you can replace with . And this sort of formula will get your features, you know, maybe not exactly, but maybe roughly into these sort of ranges. By the way, for those of you that are being super careful, technically if we're taking the range as max minus min, this Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling here will actually become a . So if max is , min is , then the range of their own values is actually equal to , but all of these are approximate and any value that gets the features into anything close to these sorts of ranges will do fine. And the feature scaling doesn't have to be too exact, in order to get gradient descent to run quite a lot faster.

So, now you know about feature scaling and if you apply this simple trick, it can make gradient descent run much faster and converge in a lot fewer iterations. In the next video, I will tell you about another trick to make gradient descent work well in practice.

<end>

Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling

相关推荐