SKLearn模型选择之模型验证方法

一、通过交叉验证计算得分
model_selection.cross_val_score(estimatoe,X)
1、estimator:实现了fit函数的学习器
2、X:array-like,需要学习的数据,可以是列表或2d数组
3、y:array-like,可选的,默认为None,监督学习中样本特征向量的真实目标值
4、scoring:string,callable or None,可选的,默认为None
一个字符or一个scorer可调用对象或函数,须实现scorer(estimator,X,y)
5、cv:int,交叉验证生成器或者一个迭代器,可选的,默认为None,决定交叉验证划分策略,cv的可选项有以下几种:
(1)None:使用默认的3-fold交叉验证
(2)Integer:指定在(Stratified)kfold中使用的“折”的数量
(3)可以用作交叉验证生成器的一个对象
(4)一个能够产生train/test划分的迭代器对象
对于integer/None类型的输入,如果estimator是一个分类器而且y是对应的类标签,则默认使用StratifiedKFold。在其他情况下,默认使用Kfold。
(5)返回值:scores,浮点数组,shape=(len(list(cv))),每一次交叉验证的得分弄成一个数组。(默认只有三次cv,就有三个得分)
实例:
#绘制svm在digits数据集上的交叉验证曲线
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn import datasets,svm
digits = datasets.load_digits()
X = digits.data
y = digits.target
svc = svm.SVC(kernel=‘linear’)
C_s = np.logspace(-10,0,10)
print(‘参数列表长度’,len(C_s))
scores = list()
scores_std = list()
for C in C_s:
svc.C = C
this_scores = cross_val_score(svc,X,y,n_jobs=4)
scores.append(np.mean(this_scores))
scores_std.append(np.std(this_scores))
#绘制交叉验证曲线
import matplotlib.pyplot as plt
plt.figure(1,figsize=(4,3))
plt.clf()
plt.semilogx(C_s,scores)
plt.semilogx(C_s,np.array(scores) + np.array(scores_std),‘b–’)
plt.semilogx(C_s,np.array(scores) - np.array(scores_std),‘b–’)
locs,labels = plt.yticks()
plt.yticks(locs,list(map(lambda x:’%g’ % x,locs)))
plt.ylabel(‘CV score’)
plt.xlabel(‘Parameter C’)
plt.ylim(0,1.1)
plt.show()
SKLearn模型选择之模型验证方法
二、对每个输入数据点产生交叉验证估计
model_selection.cross_val_predict(estimatoe,X)
sklearn.model_selection.cross_val_predict(estimator,X,y=None,groups=None,cv=None,n_jobs=1,verbose=0,fit_params=None,pre_dispatch=‘2*n_jobs’,method=‘predict’)
参数:method=‘predict’:string,optional,default:‘predict’ invokes the passed method name of the passed estimator.
SKLearn模型选择之模型验证方法
三、计算并绘制模型的学习率曲线
model_selection.learning_curve(estimatoe,X,y)
SKLearn模型选择之模型验证方法
SKLearn模型选择之模型验证方法
SKLearn模型选择之模型验证方法
#绘制学习器的交叉验证学习率曲线
import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit
def plot_learning_curve(estimator,title,X,y,ylim=None,cv=None,n_jobs=1,train_sizes=np.linspace(.1,1.0,5)):
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel(‘Training examples’)
plt.ylabel(‘Score’)
train_sizes,train_scores,test_scores = learning_curve(estimator,X,y,cv=cv,n_jobs=n_jobs,train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores,axis=1)
train_scores_std = np.std(train_scores,axis=1)
test_scores_mean = np.mean(test_scores,axis=1)
test_scores_std = np.std(test_scores,axis=1)
plt.grid()
plt.fill_between(train_sizes,train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std,alpha=0.1,color=‘r’)
plt.fill_between(train_sizes,test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std,alpha=0.1,color=‘g’)
plt.plot(train_sizes,train_scores_mean,‘o-’,color=‘r’,label=‘Training score’)
plt.plot(train_sizes,test_scores_mean,‘o-’,color=‘g’,label=‘Cross-validation score’)
plt.legend(loc=‘best’)
return plt
digits = load_digits()
X,y = digits.data,digits.target
title = ‘Learning Curves(Naive Bayes)’
cv = ShuffleSplit(n_splits=100,test_size=0.2,random_state=0)
estimator = GaussianNB()
plot_learning_curve(estimator,title,X,y,ylim=(0.7,1.01),cv=cv,n_jobs=4)
title = ‘Learning Curves(SVM,RBF kernel,γ=0.001\gamma=0.001)’
cv = ShuffleSplit(n_splits=10,test_size=0.2,random_state=0)
estimator = SVC(gamma=0.001)
plot_learning_curve(estimator,title,X,y,(0.7,1.01),cv=cv,n_jobs=4)
plt.show()
四、计算并绘制模型的验证曲线
model_selection.validation_curve(estimatoe,…)
SKLearn模型选择之模型验证方法
#绘制SVM学习器模型的验证曲线
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.model_selection import validation_curve
digits = load_digits()
X,y = digits.data,digits.target
param_range = np.logspace(-6,-1,5)
train_scores,test_scores = validation_curve(SVC(),X,y,param_name=‘gamma’,param_range=param_range,
cv=10,scoring=‘accuracy’,n_jobs=1)
train_scores_mean = np.mean(train_scores,axis=1)
train_scores_std = np.std(train_scores,axis=1)
test_scores_mean = np.mean(test_scores,axis=1)
test_scores_std = np.std(test_scores,axis=1)
plt.title(‘Validation Curve with SVM’)
plt.xlabel(’γ\gamma’)
plt.ylabel(‘Score’)
plt.ylim(0.0,1.1)
lw = 2
plt.semilogx(param_range,train_scores_mean,label=‘Training score’,color=‘darkorange’,lw=lw)
plt.fill_between(param_range,train_scores_mean - train_scores_std,train_scores_mean + train_scores_std,alpha=0.2,
color=‘darkorange’,lw=lw)
plt.semilogx(param_range,test_scores_mean,label=‘Cross-validation score’,color=‘navy’,lw=lw)
plt.fill_between(param_range,test_scores_mean - test_scores_std,test_scores_mean +test_scores_std,alpha=0.2,
color=‘navy’,lw=lw)
plt.legend(loc=‘best’)
plt.show()
五、通过排序评估交叉验证得分的重要性
model_selection.permutation_test_score(…)
SKLearn模型选择之模型验证方法