机器学习系列之coursera week 6 Advice for Applying Machine Learning
目录
1. Evaluating a learning Algorithm
1.1 Deciding what to try next
1.2 Evaluating a hypothesis
1.3 Model selection and training/validation/test
2. Bias VS Variance
2.1 Diagnosing bias VS variance
2.2 Regulatization and Bias/Variance
2.3 Learning curves
2.4 Deciding waht to do next
3. Buildig a spam classifier
3.1 Prioitizing what to work on: Spam classification example
3.2 Error analysis
4. Handling skewed data
4.1 Error metrices for skewed classes
4.2 Trading off precision and recall
5. Using large data sets
1. Evaluating a learning Algorithm
1.1 Deciding what to try next
Debugging what to try next:
suppose you have implemented regularized linear regression to predict housing price
However, when you test your hypothesis on a new set of houses, you find that it makes unacceptably large error in its predictions.what should you do next?
- Get more training examples
- Try smaller sets of features
- Try getting addional features
- Try adding polynomial features
- Try decreasing λ
- Try increasing λ
Machine learning diagnostic:
Diagnostic: A test that you can run to gain insight what is/isn't working with a learning algorithm. and gain guidance as to how best to improve its performance.
Diagnostic can take time to implement, but dong so can be a very good use of your time.
1.2 Evaluating a hypothesis
Evaluating your hypothesi:
图1
(引自coursera week 6 Evaluating a hypothesis)
图2
(引自coursera week 6 Evaluating a hypothesis)
Training/testing procedure for linear regression:
- Learn parameter θ from training data(minizing training error J(θ))
- Compute test error: J_test(θ)
Training/testingprocedure for LR:
- Learn parameter θ from training data
- Compute test set error: J_test(θ)
Misclassification error:
1.3 Model selection and training/validation/test
Model selection:
图3
(引自coursera week 6 Model selection and training/validation/test)
How well does the model generalize? Report test set error J_test(θ_5)
Problem: J_test(θ_5) is likely to be an optimistic estimate of generalization error. I.e. our extra parameter(d) is fit to test set.(也就是说用测试集找出的最佳d,这样得到的不是最佳泛化误差,用训练集也不行)
图4
(引自coursera week 6 Model selection and training/validation/test)
图5
(引自coursera week 6 Model selection and training/validation/test)
2. Bias VS Variance
2.1 Diagnosing bias VS variance
图6
(引自coursera week 6 Diagnosing bias VS variance)
图7
(引自coursera week 6 Diagnosing bias VS variance)
Diagnosing bias VS variance:
Suppose your learning algorithm is performing less well than you were hoping(J_test or J_CV is high). Is it a bias problem or variance problem?
图8
(引自coursera week 6 Diagnosing bias VS variance)
Bias(underfit): training error(high) approximately equal to CV error(high)
Variance(overfit): training error(low) << CV error(high)
2.2 Regulatization and Bias/Variance
Linear regression with regularization
E.g.
model:
图9
(引自coursera week 6 Regulatization and Bias/Variance)
Choosing the regularization parameter λ
E.g:
图10
(引自coursera week 6 Regulatization and Bias/Variance)
choose the λ with lowest J_CV
Plotting bia/variance as a function of the regularization parameter λ.
图11
(引自coursera week 6 Regulatization and Bias/Variance)
2.3 Learning curves
图12
(引自coursera week 6 Learning curves)
m通常比样本数少
High bias:
图13
(引自coursera week 6 Learning curves)
High variance:
图14
(引自coursera week 6 Learning curves)
2.4 Deciding waht to do next
Debugging a learn algorithm:
Suppose you have implemented regularized linear regression to predict housing price. However, when you test your hypothesis in a new set of houses, you finc that it makes unacceptably large errors in its prediction. What should you try next?
- Get more training examples -----------> high variance
- Try smaller sets of features ----------->high variance
- Try getting addional features ----------->high bias
- Try adding polynomial features ----------->high bias
- Try decreasing λ ----------->high bias
- Try increasing λ ----------->high variance
图15
(引自coursera week 6 Deciding waht to do next)
使用大型神经网络的性能往往更好,如果出现过拟合,则使用正则化。
3. Buildig a spam classifier
3.1 Prioitizing what to work on: Spam classification example
Building a spam classifier:
supervised learning x = features of email. y = Spam(1) or not Spam(0)
Features x: choose 100 wordsindicative of spam/not spam
E.g: deal, buy, discount, andrew, now.....
图16
(引自coursera week 6 Prioitizing what to work on: Spam classification example)
Note: In practice, take most frequently occurring n words(10000 to 50000) in training set, rather than manually pick 100 words.
How to spend your time to make it have low error?
- Collect lots of data
- Develope sophisticated features based on email routing information(from email header)
- Develope sophisticated features for message body. E.g: should "discount" and "discounted" be treated as the same word?
- Develope sohisticated algorithm to detect misspellings.
3.2 Error analysis
m_CV = 500 examples in cross validation set.
Algorithm miscalssifies 100 emails.
Manually examine the 100 errors, and categorize them base on "
(1) what type of email it is
(2) what features you think would have helped the algorithm classify them correctly:
- Deliberate misspellings
- Unusual email routing(来源)
- Unuaual punctuation(标点)
The importance of numerical evaluation:
Should discount/discounts/discounted/discounting e treated as the same word?
Can use "stemming" software(词干提取软件)(e.g. porter stemmer)
Error analysis may nit be helpful for deciding if this is likely to improve performance. Only solution is to try it and see it works.
Need numerical evaluation(e.g. cross validation error) of algorithms performance with and without stemming.
在CV set 上进行error analysis
4. Handling skewed data
4.1 Error metrices for skewed classes
Cancer classification example.
Train logistic regression model h(x). (y=1 if cancer, y=0 otherwise)
Find that you got 1% error on test set.
But only 0.05% of patients have cancer. 正负样本只比非常大称为skewed classes.
当在skewed classes上使用accuracy通常不那么好,因此需要新的误差度量指标。
Precision/Recall:
y = 1 in presemce of rare calss that we want to detect
图17
(引自coursera week 6 Error metrices for skewed classes)
4.2 Trading off precision and recall
LR: 0<= h(x) <= 1
predict 1 if h(x) >= 0.5
predict 0 if h(x) < 0.5
(1) suppose we want to predict y = 1 (cancer) only if very confident
------> h(x) >= 0.7, y = 1
h(x) < 0.7, y = 0
------> higher precision, lower recall
(2) suppose we want to avoid missing too many cases of cancer(avoid false negatives)
------> h(x) >= 0.3, y = 1
h(x) < 0.3, y = 0
------> lower precision, higher recall
Change threshold to plot P-R cruve:
图18
(引自coursera week 6 Trading off precision and recall)
F1 score(F score):
How to caompare P/R numbers?
图19
(引自coursera week 6 Trading off precision and recall)
note: 在CV set 上计算
5. Using large data sets
(1) Features x have sufficient information to predict y accurately.
useful test: given the input x, a human expert can confidently predict y.
(2) Use a learning algorithm with many parameters.
(3) Use a very large training set(unlikely to overfit)