Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

Training Models


1. Linear Regression

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4


Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4


Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4


2. The Normal Equation


Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

函数:  np.linalg.inv() ,LinearRegression()

3.Computational Complexity


Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

On the positive side, this equation is linear with regards to the number of instances in the training set (it is O(m)), so it handles large training sets efficiently, provided they can fit in memory. predictions are very fast: the computational complexity is linear with regards to both the number of instances you want to make predictions on and the number of features. In other words, making predictions on twice as many
instances (or twice as many features) will just take roughly twice as much time.

4.Gradient Descent

The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function.

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

MSE cost function for a Linear Regression model happens to be a convex function, which means that if you pick any two points on the curve, the line segment joining them never crosses the curve. This implies that there are no local
minima, just one global minimum.

These two facts have a great consequence: Gradient Descentis guaranteed to approach arbitrarily close the global minimum (if you wait long  enough and if the learning rate is not too high)

(1) Batch Gradient Descent

Batch Gradient Descent: it uses the whole batch of training data at every step

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4


To find a good learning rate, you can use grid search.    A simple solution is to set a very large number of iterations but to interrupt the algorithm when the gradient vector becomes tiny—that is, when its norm becomes smaller than a tiny number  ϵ (called the tolerance)—because this happens when Gradient Descent has (almost) reached the minimum.

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

(2)Stochastic Gradient Descent

 Stochastic Gradient Descent just picks a random instance in the training set at every step and computes the gradients
based only on that single instance.It also makes it possible to train on huge training sets, since only one instance needs to be in memory at each iteration (SGD can be implemented as an out-of-core algorithm. this algorithm is much less regular than Batch Gradient Descent: instead of gently decreasing until it reaches the minimum, the cost function will bounce up and down, decreasing only on average. the final parameter values are good, but not optimal

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4


 the algorithm can never settle at the minimum. One solution to this dilemma is to gradually reduce the learning rate. The steps start out large (which helps make quick progress and escape local minima), then get smaller and smaller, allowing the algorithm to settle at the global minimum. This process is called simulated annealing(模拟退火算法以一定的概率来接受一个比当前解要差的解,因此有可能会跳出这个局部的最优解,达到全局的最优解),

函数:SGDRegressor()


(3)Mini-batch Gradient Descent

mini-batch GD computes the gradients on small random sets of instances called mini-batches. The main advantage of Mini-batch GD over Stochastic GD is that you can get a performance boost from hardware optimization of matrix operations, especially when using GPUs.


Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4


Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

5. Polynomial Regression


A simple way to do this is to add powers of each feature as new features, then train a linear model on this extended
set of features.

函数:PolynomialFeatures(),fit_transform(),LinearRegression()

PolynomialFeatures(degree=d) transforms an array containing n features into an array containing Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4features.


6. Learning Curves

If you perform high-degree Polynomial Regression, you will likely fit the training data much better than with plain Linear Regression.

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

 7. The Bias/Variance Tradeoff

    Bias:This part of the generalization error is due to wrong assumptions, such as assuming that the          data is linear when it is actually quadratic. A high-bias model is most likely to underfit the training             data

      Variance:This part is due to the model’s excessive sensitivity to small variations in the training             data. A model with many degrees of freedom (such as a high-degree polynomial model) is likely to           have  high variance, and thus to overfit the training data.

       Irreducible error:This part is due to the noisiness of the data itself. The only way to reduce this               part of the error is to clean up the data

       Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4


         Increasing a model’s complexity will typically increase its variance and reduce its bias. Conversely, reducing a          model's complexity increases its bias and reduces its variance. This is why it is called a tradeoff.

过多的变量(特征),同时只有非常少的训练数据,会导致出现过度拟合的问题  

 方法一:尽量减少选取变量的数量,保留重要的特征变量

方法二:正则化

保留所有的特征变量,但是会减小特征变量的数量级(参数数值的大小θ(j))


 8.Regularized Linear Models

   岭回归与Lasso回归的出现是为了解决线性回归出现的过拟合以及在通过正规方程方法求解θ的过程中出现的x转置乘以x不可逆这两类问题的,

(1)Ridge Regression(also called Tikhonov regularization, L2范数)

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

This forces the learning algorithm to not only fit the data but also keep the model weights as small as possible.

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

        Note that the bias term θ 0 is not regularized (the sum starts at i = 1, not 0)

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

函数: Ridge(), SGDRegressor()

    (2)Lasso Regression(L1范数)

     Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

An important characteristic of Lasso Regression is that it tends to completely eliminate the weights of the least important features (i.e., set them to zero).Lasso Regression automatically performs feature selection and outputs a
sparse model (i.e., with few nonzero feature weights)

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

函数:Lasso()


       (3)Elastic Net

  The regularization term is a simple mix of both Ridge and Lasso’s regularization terms, and you can control the mix ratio r.

  Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

(4)Early Stopping

A very different way to regularize iterative learning algorithms such as Gradient Descent is to stop training as soon as the validation error reaches a minimum. This is called early stopping.

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

With early stopping you just stop training as soon as the validation error reaches the minimum.

9.Logistic Regression

(1)Estimating Probabilities

instead of outputting the result directly like the Linear Regression model does, it outputs the logistic of this result

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4


(2)Training and Cost Function

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4


Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

 this cost function is convex, so Gradient Descent (or any other optimization algorithm) is guaranteed to find the global minimum

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

(3)Decision Boundaries

函数:load_iris(), LogisticRegression(), predict_proba(),

Let’s try to build a classifier to detect the Iris-Virginica type based only on the petal width feature.

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

(4) Softmax Regression(Multinomial Logistic Regression)

The Logistic Regression model can be generalized to support multiple classes directly, without having to train and combine multiple binary classifiers. when given an instance x, the Softmax Regression model first computes a score s k (x) for each class k, then estimates the probability of each class by applying the softmax function (also called the normalized exponential) to the scores.

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

Notice that when there are just two classes (K = 2), this cost function is equivalent to the Logistic Regression’s cost function

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

 use Gradient Descent(or any other optimization algorithm) to find the parameter matrix Θ that minimizes the cost function.

Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4