机器学习笔记 ---- Introduction + Linear Regression
Introduction
1.Definition of Machine Learning
A computer program is said to learn from experience E, with respect to some task T, and some performance measure P, if its performance on T as measured by P improves with experience E.
2.Supervised & Unsupervised Learning
Supervised Learning: answer is given
1)Regression: output is continuous
2)Classification: output is discrete
Unsupervised Learning: answer do not know, only data is available
1) clustering 聚类
2) non-clustering 非聚类
"Cocktail Party Algorithm"
Model, Cost Function and Gradient Descent
some definitions:
m: 训练集大小
n: 输入的特征数
x: input
y: output
$$(x^{(i)},y^{(i)})$$
第i个训练集第j个元素,输入为$${x_{j}}^{(i)}$$输出为$${y_{j}}^{(i)}$$
hypothesis: 目标函数
1. Linear Regression
$$h(x)=\theta_{0}+\theta _{1}*x$$
θ : pamameters 参数
2. Cost Function
J(θ)
Example : Squared Error Function = $$\frac{1}{2m}\sum_{i=1}^{m}(h(x^{(i)})-y^{(i)})^2$$
Contour Plot 坡度图
3.Gradient Descent
1) Start with some θ
2) Keep changing θ until cost function is minimum
repeat until convergence:
$$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)$$ 同步迭代
α: Learning Rate, no need to change during iteration
Contour Plot for Linear Regression is always a convex function! (bowl like)
Batch Gradient Descent: using all training sets in each step
Linear Regression for Multiple Features
1. Linear Regression in Linear Algebra Form
$$h(x) = \theta_{0}*x0+\theta _{1}*x1+... = \theta ^{T}*X$$
where x0=1 always
2. Iteration Formula
$$\theta_{j}:=\theta_{j}-\alpha *\frac{1}{m}\sum_{i=1}^{m}(h(x^{(i)}) - y^{(i)})*x_j^{(i)}$$
3. Feature Scale
Make sure parameters are on the same range so that they converge more quickly. (Normally -1≤x≤1)
4. Mean Normalization
Make features 0 on average.
5. Learning Rate and Requisitions of Working Properly
1) Cost Function decrease after every iteration.
2) The curve J() with times of iteration has not been too flat.
example: J() increase
Solution: Decrease learning rate.
6.Polynomial Regression
Treat the power of feature as new features. (x,x^2,x^3....x^n)
The new feature can be any that comes from the calculation of other features.
7.Normal Equation - Solving Linear Regression Analytically
1.set the derivative of J(θ) to 0 for every θ
2.Design Matrix X=[x0 x1 x2 .... xn]^T where xi is the i-th training set
Y=[y1 y2 y3......yn]^T
then $$\theta=(X^{T}X)^{-1}X^{T}y$$
3. Disadvantage: Slow when n is large
4. Non-inverting Situation: using pinv() ---- 伪逆矩阵
1) m<=n
2) two features that are linear dependent