Linear Regression with multiple variables - Gradient descent in practice I: Feature Scaling
摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程,第五章《多变量线性回归》中第30课时《多元梯度下降法实践 I: 特征缩放》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正,使其更加简洁,方便阅读,以便日后查阅使用。现分享给大家。如有错误,欢迎大家批评指正,在此表示诚挚地感谢!同时希望对大家的学习能有所帮助。
In this video(article) and the video (article) after this one, I wanna tell you about some of the practical tricks for making gradient descent work well. In this video (article), I want to tell you about an idea called feature scaling. Here is the idea.
If you have a problem where you have multiple features, if you make sure that the features are on a similar scale, by which I mean make sure that the different features take on similar range of values, then gradient descent can converge more quickly. Concretely, let's say you have a problem with two features where is the size of house and takes on values between say
to
, and
is the number of bedrooms and maybe that takes on values between
and
. If you plot the contours of the cost function
, then the contours may look like this, where, let's see,
is a function of parameters
,
and
. I'm going to ignore
, so let's forget about
and pretend as a function of only
and
. If
can take on, you know, much larger range of values than
, it turns out that the contours of the cost function
can take on this sort of very very skewed elliptical shape, except that with the
to
ratio, it can be even more skewed. So, this is very very tall and skinny ellipses, or these very tall skinny ovals, can form the contours of the cost function
. And if you run gradient descents on this sort of cost function, your gradients may end up taking a long time, and can oscillate back and forth and take a long time before it can finally find its way to the global minimum. In fact, you can imagine if these contours are exaggerated even more when you draw incredibly tall skinny contours, and it can be even more extreme than that, then it turns out gradient descent will just have a much harder time taking its way, meandering around, it can take a long time to find its way to the global minimum.
In these settings, a useful thing to do is to scale the features. Concretely, if you instead define the feature to be the size of the house divided by
, and define
to be maybe the number of bedrooms divided by
, then the contours of the cost function
can become more less skewed, so the contours may look more like circles. And if you run gradient descent on a cost function like this, then gradient descent, you can show mathematically, can find a much more direct path to the global minimum, rather than taking a much more convoluted path trying to follow a much more complicated trajectory to get to the global minimum. So, by scaling the features so that they take on similar ranges of values. In this example, we end up with both features,
and
, between
and
. You can wind up with an implementation of gradient descent. They can converge much faster.
More generally, when we're performing features scaling, what we often want to do is get every feature into approximately a to
range. And concretely, your feature
is always equal to
. So, that's already in that range, but you may end up with dividing other features by different numbers to get them into this range. And the numbers
and
aren't too important. So, if you have a feature
that winds up being between
and
, say, that's not a problem. If you end up having a different feature that winds up being between
and
, again, this is close enough to
and
, that's fine. It's only if you have a different feature
, say, that ranges from
to
, then this is a very different values than
and
. So, this might be a less well-scaled feature. And similarly, if your features take on a very, very small range of values, so if
takes on values between
and
, then, again this takes on a much smaller range of values than
and
range. And again I would consider this feature poorly scaled. So you want the range of values, it can be bigger than
or smaller than
, but just not much bigger, like
here, or too much smaller like
one over there. Different people have different rules of thumb. But the one that I use is that if a feature takes on the range of values from say
to
, I usually think that should be just fine, but maybe it takes on much larger values than
or
I start to worry. And if it takes on values from say
to
. You know I think that's fine too, or
to
or
to
. I guess that's typical range of values I consider okay. But if it takes on a much tinier range of values like
here then again you start to worry. So, the take-home message is don't worry if your features are not exactly on the same scale or exactly in the same range of values. But so long as they're all close enough to this gradient descent should work okay.
In addition to dividing by the maximum value when performing feature scaling, sometimes people will also do what's called the mean normalization. And what I mean by that is that you want to take a feature and replace it with
to make your features to have approximately
mean. And obviously we won't apply this to feature
, because the feature
is always equal to
so it cannot have an average value of
. But, concretely, for other features if the range of sizes of the house takes on values between
to
and, if you know, the average size of a house is equal to
. Then you might use this formula, set the feature
. And similarly, if your houses have one to five bedrooms and if on average a house has two bedrooms, then you might use this formula to mean-normalize your second feature
. In both of these cases, you therefore end up with features
and
. They can take on values roughly between
and
. Exactly not true -
can actually be slightly larger than
, but you know, close enough.
And the more general rule is that you might take a feature and replace it with
where to define these terms:
- is the average value of
in the training sets,
- and is the range of values of that feature, and by range, I mean the maximum value minus the minimum value, or for those of you that understand the deviation of the variable, setting
to be the standard deviation of the variable would be fine, too. But taking this max minus min would be fine.
And similarly for the second feature , you can replace
with
. And this sort of formula will get your features, you know, maybe not exactly, but maybe roughly into these sort of ranges. By the way, for those of you that are being super careful, technically if we're taking the range as max minus min, this
here will actually become a
. So if max is
, min is
, then the range of their own values is actually equal to
, but all of these are approximate and any value that gets the features into anything close to these sorts of ranges will do fine. And the feature scaling doesn't have to be too exact, in order to get gradient descent to run quite a lot faster.
So, now you know about feature scaling and if you apply this simple trick, it can make gradient descent run much faster and converge in a lot fewer iterations. In the next video, I will tell you about another trick to make gradient descent work well in practice.
<end>