机器学习笔记 ---- Introduction + Linear Regression

Introduction

1.Definition of Machine Learning

A computer program is said to learn from experience E, with respect to some task T, and some performance measure P, if its performance on T as measured by P improves with experience E.

 

2.Supervised & Unsupervised Learning

 

Supervised Learning: answer is given

1)Regression: output is continuous

2)Classification: output is discrete

 

Unsupervised Learning: answer do not know, only data is available

1) clustering 聚类

2) non-clustering 非聚类

"Cocktail Party Algorithm"

 

Model, Cost Function and Gradient Descent

some definitions:

m: 训练集大小

n: 输入的特征数

x: input

y: output

$$(x^{(i)},y^{(i)})$$

第i个训练集第j个元素,输入为$${x_{j}}^{(i)}$$输出为$${y_{j}}^{(i)}$$ 

hypothesis: 目标函数

机器学习笔记 ---- Introduction + Linear Regression

1. Linear Regression

$$h(x)=\theta_{0}+\theta _{1}*x$$

θ : pamameters 参数

2. Cost Function

J(θ)

Example : Squared Error Function = $$\frac{1}{2m}\sum_{i=1}^{m}(h(x^{(i)})-y^{(i)})^2$$

Contour Plot 坡度图

3.Gradient Descent 

1) Start with some θ

2) Keep changing θ until cost function is minimum

 

repeat until convergence:

$$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)$$   同步迭代

α: Learning Rate, no need to change during iteration

 

Contour Plot for Linear Regression is always a convex function! (bowl like)

Batch Gradient Descent: using all training sets in each step 

 

Linear Regression for Multiple Features

 

1. Linear Regression in Linear Algebra Form

$$h(x) = \theta_{0}*x0+\theta _{1}*x1+... = \theta ^{T}*X$$

where x0=1 always 

2. Iteration Formula

$$\theta_{j}:=\theta_{j}-\alpha *\frac{1}{m}\sum_{i=1}^{m}(h(x^{(i)}) - y^{(i)})*x_j^{(i)}$$

3. Feature Scale

Make sure parameters are on the same range so that they converge more quickly. (Normally -1≤x≤1)

4. Mean Normalization

Make features 0 on average.

5. Learning Rate and Requisitions of Working Properly

1) Cost Function decrease after every iteration.

2) The curve J() with times of iteration has not been too flat.

   example: J() increase

   Solution: Decrease learning rate.

6.Polynomial Regression

Treat the power of feature as new features. (x,x^2,x^3....x^n)

The new feature can be any that comes from the calculation of other features.

7.Normal Equation - Solving Linear Regression Analytically

1.set the derivative of J(θ)  to 0 for every θ

2.Design Matrix X=[x0 x1 x2 .... xn]^T where xi is the i-th training set

                          Y=[y1 y2 y3......yn]^T

   then $$\theta=(X^{T}X)^{-1}X^{T}y$$

3. Disadvantage: Slow when n is large 

4. Non-inverting Situation: using pinv() ---- 伪逆矩阵

1) m<=n

2) two features that are linear dependent