python 实例 SVM SVR cv 核函数 LinearSVR、RBFSampler、 SGDRegressor和 Nystroem的使用
SVM实例,两个数据,两个例子。cancer data样本量小,分类数据用svc函数,较为简单;houseprice样本量大,连续数据,用了支持向量回归SVR,函数先用了RBFSampler和 Nystroem做核映射,然后用SGDRegressor做支持向量回归,使用的这三个函数都很适合大样本。
I. 准备
1.import...
In [1]:
import pandas as pd import numpy as np import warnings import matplotlib.pyplot as plt import seaborn as sns from scipy import stats,integrate sns.set_style("darkgrid") sns.set(color_codes=True) plt.rcParams['font.sans-serif'] = ['SimHei'] # 绘图时可以显示中文 plt.rcParams['axes.unicode_minus']=False # 绘图时显示负号 warnings.filterwarnings("ignore") # 不要显示警告
II. cancer data
1.read data
In [2]:
cancer = pd.read_excel('C:\\Users\\91333\\Documents\\semester6\\data science\\3.NB&DT\\Week3_CancerDataset.xlsx')
2.数据标准化
In [3]:
cancer_scaled = cancer.iloc[:,1:-1].apply(lambda x: (x - np.mean(x)) / (np.std(x)))
3. 训练SVM模型
对于三种常见的核函数,分别训练SVM模型,并调节参数
(1)线性核函数
线性核函数没有映射到高维度空间,由于是soft margin svm,所以需要调节的参数只有容忍错判的样本的惩罚系数C,用cv结果来挑选C。
In [4]:
from sklearn.svm import SVC,SVR,LinearSVR from sklearn.model_selection import cross_val_score C_range = range(1,31) cv_scores = [] for c in C_range: clf = SVC(C=c,kernel = 'linear') scores = cross_val_score(clf, cancer_scaled, cancer.iloc[:, -1], cv=5, scoring='accuracy') cv_scores.append(scores.mean()) print('当松弛变量C取{}时,score最大,为{}'.format(C_range[cv_scores.index(max(cv_scores))],max(cv_scores))) plt.figure(figsize=(10,4)) plt.plot(C_range,cv_scores) plt.xlabel("C") plt.ylabel("score") plt.title("score随惩罚系数C变化图")
当松弛变量C取6时,score最大,为0.973728357060408
Out[4]:
Text(0.5, 1.0, 'score随惩罚系数C变化图')
(2)高斯核函数
由于存在两个参数松弛变量C和gamma,使用GridSearchCV函数进行网格搜索。
In [5]:
from sklearn.model_selection import GridSearchCV from sklearn.metrics import fbeta_score, make_scorer ftwo_scorer = make_scorer(fbeta_score, beta=2) grid1 = GridSearchCV(SVC(kernel = 'rbf'), param_grid={'C': [1,10,20],'gamma': [0.09,0.009,0.0009,0.00009]}, scoring=ftwo_scorer, cv=5) grid1.fit(cancer_scaled, cancer.iloc[:, -1]) print('使用GridSearchCV网格搜索后,最优的松弛变量C取值为{},最优的gamma取值为{},该模型score为{}'.format( grid1.best_estimator_.get_params()['C'],grid1.best_estimator_.get_params()['gamma'],grid1.best_score_))
使用GridSearchCV网格搜索后,最优的松弛变量C取值为10,最优的gamma取值为0.009,该模型score为0.9882421118919705
(3)多项式核函数
存在两个参数松弛变量C和阶数,同样利用GridSearchCV函数
In [6]:
grid2 = GridSearchCV(SVC(kernel = 'poly'), param_grid={'C': [1,5,10,15,20],'degree': [1,2,3,4,5,6,7,8,9]}, scoring=ftwo_scorer, cv=5) grid2.fit(cancer_scaled, cancer.iloc[:, -1]) print('使用GridSearchCV网格搜索后,最优的松弛变量C取值为{},最优的阶数取值为{},该模型score为{}'.format( grid2.best_estimator_.get_params()['C'],grid2.best_estimator_.get_params()['degree'],grid2.best_score_)) plt.scatter(x=grid2.cv_results_['param_C'].data, y=grid2.cv_results_['param_degree'].data, s=1000*(grid2.cv_results_['mean_test_score']-min(grid2.cv_results_['mean_test_score'])+0.1), cmap=plt.cm.get_cmap('RdYlBu')) plt.xlabel("C") plt.ylabel("degree") plt.title("score随惩罚系数C和degree变化散点图") plt.annotate('圆圈大小表示score大小', xy=(1,1), xytext=(1, -1),color='b',size=10) plt.annotate('这个点最优',xy=(1,1),xytext=( grid2.best_estimator_.get_params()['C']+1,grid2.best_estimator_.get_params()['degree']),color='r')
使用GridSearchCV网格搜索后,最优的松弛变量C取值为20,最优的阶数取值为3,该模型score为0.9867086284245016
Out[6]:
Text(21, 3, '这个点最优')
奇偶degree取值使得score交替变化,奇特。
4. 写出超平面
尝试了一下,只有没有映射到高维空间的线性核函数SVM可以取出超平面表达式,设超平面h(x)=wx+b。
(1)方向w
In [7]:
clf = SVC(kernel = 'linear') clf.fit(cancer_scaled, cancer.iloc[:, -1]) print(clf.coef_)
[[-8.80113094e-02 -3.59306843e-01 -3.35013827e-01 -8.39562339e-04 6.46330866e-01 -7.45104189e-01 -9.27479275e-01 -7.01658929e-02 3.81723561e-01 -8.64975132e-01 3.24937431e-01 -2.37246961e-01 -8.85892419e-01 -3.70854979e-01 3.81626684e-01 3.71302098e-01 -4.33843461e-01 8.46263549e-02 8.92709596e-01 -6.49610348e-01 -1.00930557e+00 -3.76030776e-01 -7.55483286e-01 -3.94390435e-01 1.53334795e-01 -1.02528284e+00 -1.18490959e-01 -4.30099332e-01 -8.64181126e-01]]
这是一个和样本feature数同维数的向量,31维。
(2)截距b
In [8]:
print(clf.intercept_ )
[0.00091257]
III. houseprice
1. read data
In [9]:
cal_housing = pd.read_csv('C:\\Users\\91333\\Documents\\semester6\\data science\\5.SVM\\房价预测\\cal_housing.data', header=None,names=['longitude','latitude','Age','Rooms','Bedrooms','population','households','Income','HouseValue'])
2.探索一下
1.前5行
In [10]:
cal_housing.head(5)
Out[10]:
longitude | latitude | Age | Rooms | Bedrooms | population | households | Income | HouseValue | |
---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 |
2.可视化
(1)经纬度和房屋价格的散点图
In [11]:
plt.figure(figsize=[17,7]) plt.subplot(1,2,1) plt.scatter(cal_housing['longitude'].values,cal_housing['HouseValue'].values,alpha=0.05) plt.xlabel("longitude") plt.ylabel("MedianHouseValue") plt.title("scatterplot of longitude and MedianHouseValue",fontsize="x-large") plt.subplot(1,2,2) plt.scatter(cal_housing['latitude'].values,cal_housing['HouseValue'].values,alpha=0.05) plt.xlabel("latitude") plt.ylabel("MedianHouseValue") plt.title("scatterplot of latitude and MedianHouseValue",fontsize="x-large")
Out[11]:
Text(0.5, 1.0, 'scatterplot of latitude and MedianHouseValue')
无论经度还是纬度,都主要有两种区间:第一种区间上,房屋价格的分布考下,很多都在200000以下;第二种区间上,各种价格的房屋都存在,而且都中间价格的多,点更密集,两头更稀疏。图中的规律和现实大致相符,在主城区各种价格的房屋都有,而在乡村偏远地区,主要是价格低的房屋。
2)收入和房屋价格的核密度图
In [12]:
sns.jointplot(x='Income',y='HouseValue',data=cal_housing,kind='kde')
Out[12]:
<seaborn.axisgrid.JointGrid at 0x267bd149c88>
收入和房屋价格正相关,符合生活经验。 其他变量与房屋价格的双变量图像效果不好,我就没展示。
3. 数据预处理
(1)数据标准化
In [13]:
cal_housing_scaled = cal_housing.apply(lambda x: (x - np.mean(x)) / (np.std(x)))
(2)主成分分析
主成分分析的目的是降维,加快后面SVR的速度。
In [14]:
from sklearn.decomposition import PCA pca=PCA().fit(cal_housing_scaled.iloc[:,0:-1]) print('方差累计贡献率为{}'.format(np.cumsum(pca.explained_variance_ratio_)))
方差累计贡献率为[0.48833528 0.7268287 0.86082389 0.96357371 0.98208055 0.99228857 0.99815104 1. ]
前四个主成分累计贡献了0.96的方差变异,所以选取前四个主成分就可以。
In [15]:
pca_cal=pca.fit_transform(cal_housing_scaled.iloc[:,0:-1])[:,0:4]
4.训练SVR模型
(1)线性核函数SVR
sklearn的帮助文档说,当线性核函数时,由于LinearSVR用的是liblinear,而SVR用的是libsvm,LinearSVR函数要比SVR函数快,我试了一下果然。
a.PCA后
In [16]:
import time start = time.time() cv_scores_pca_linear = [] for c in C_range: svr = LinearSVR(C=1/c) scores = cross_val_score(svr, pca_cal, cal_housing_scaled.iloc[:, -1], cv=5) cv_scores_pca_linear.append(scores.mean()) end = time.time() print('当松弛变量C取{}时,score最大,为{},运行时间为{}秒'.format( C_range[cv_scores_pca_linear.index(max(cv_scores_pca_linear))],max(cv_scores_pca_linear),end-start)) plt.figure(figsize=(10,4)) plt.plot(C_range,cv_scores_pca_linear) plt.xlabel("C") plt.ylabel("score") plt.title("PCA后LinearSVR函数score随惩罚系数C变化图")
当松弛变量C取26时,score最大,为0.4471613979532877,运行时间为62.31103038787842秒
Out[16]:
Text(0.5, 1.0, 'PCA后LinearSVR函数score随惩罚系数C变化图')
b.PCA前
In [17]:
start = time.time() cv_scores_linear = [] for c in C_range: svr = LinearSVR(C=1/c) scores = cross_val_score(svr, cal_housing_scaled.iloc[:, 0:-1], cal_housing_scaled.iloc[:, -1], cv=5) cv_scores_linear.append(scores.mean()) end = time.time() print('当松弛变量C取{}时,score最大,为{},运行时间为{}秒'.format( C_range[cv_scores_linear.index(max(cv_scores_linear))],max(cv_scores_linear),end-start)) plt.figure(figsize=(10,4)) plt.plot(C_range,cv_scores) plt.xlabel("C") plt.ylabel("score") plt.title("PCA前LinearSVR函数score随惩罚系数C变化图")
当松弛变量C取10时,score最大,为0.5756157136922346,运行时间为84.73883938789368秒
Out[17]:
Text(0.5, 1.0, 'PCA前LinearSVR函数score随惩罚系数C变化图')
PCA前的总运行时间比PCA后的总运行时间多了约20秒,但是正确率低了10%以上,无法容忍,所以下面用PCA前的数据。
(2)高斯核函数
a.核函数映射用RBFSampler函数
借助两个函数RBFSampler和SGDRegressor函数,利用RBFSampler先把原数据映射到feature空间,然后再进行SVR操作。 帮助文档说,SGDRegressor函数也可以做SVR,参数alpha是惩罚系数,比LinearSVR的C要小很多,alpha的默认值alpha=0.0001,C的默认值是1,我也不知道是怎么定义的。 SGDRegressor函数真的好快,帮助文档推荐1万以上的样本用SGDRegressor函数。
In [18]:
from sklearn.kernel_approximation import RBFSampler from sklearn.linear_model import SGDRegressor start = time.time() alpha_range=[-1,-3,-5,-7,-9,-11] gamma_range=[-1,-1.5,-2,-2.5,-3] cv_scores_RBFS=np.zeros(shape=(len(alpha_range),len(gamma_range))) for i in range(len(alpha_range)): sdgclf=SGDRegressor(loss="epsilon_insensitive",alpha=10**alpha_range[i]) for j in range(len(gamma_range)): rbf_feature = RBFSampler(gamma = 10**gamma_range[j], random_state=1) X_features = rbf_feature.fit_transform(cal_housing_scaled.iloc[:,0: -1]) scores = cross_val_score(sdgclf, X_features, cal_housing_scaled.iloc[:, -1], cv=5) cv_scores_RBFS[i,j] = scores.mean() end = time.time() m, n = cv_scores_RBFS.shape index = int(cv_scores_RBFS.argmax()) i = int(index / n) j = index % n print('当惩罚系数alpha取10^{}、高斯核函数的gamma值取10^{}时,score最大,为{},运行时间为{}秒'.format(alpha_range[i],gamma_range[j],cv_scores_RBFS.max(),end-start)) print(cv_scores_RBFS)
当惩罚系数alpha取10^-11、高斯核函数的gamma值取10^-1.5时,score最大,为0.5706462592558829,运行时间为100.89476466178894秒 [[ 0.12197254 0.09333182 -0.01330377 -0.08902908 -0.11553457] [ 0.55536832 0.5593231 0.52558767 0.46167317 0.28025331] [ 0.56701374 0.57042556 0.53751988 0.49345061 0.39564883] [ 0.5656155 0.5686655 0.53074254 0.49179496 0.39597599] [ 0.5672084 0.56981525 0.53735416 0.49079757 0.39544188] [ 0.5636954 0.57064626 0.53562283 0.49084243 0.39619231]]
b.核函数映射用Nystroem函数
In [19]:
from sklearn.kernel_approximation import Nystroem start = time.time() alpha_range=[-1,-3,-5,-7,-9,-11] gamma_range=[-1,-1.5,-2,-2.5,-3] cv_scores_Nyst=np.zeros(shape=(len(alpha_range),len(gamma_range))) for i in range(len(alpha_range)): sdgclf=SGDRegressor(loss="epsilon_insensitive",alpha=10**alpha_range[i]) for j in range(len(gamma_range)): rbf_feature = Nystroem(gamma = 10**gamma_range[j], random_state=1) X_features = rbf_feature.fit_transform(cal_housing_scaled.iloc[:,0: -1]) scores = cross_val_score(sdgclf, X_features, cal_housing_scaled.iloc[:, -1], cv=5) cv_scores_Nyst[i,j] = scores.mean() end = time.time() m, n = cv_scores_Nyst.shape index = int(cv_scores_Nyst.argmax()) i = int(index / n) j = index % n print('当惩罚系数alpha取10^{}、高斯核函数的gamma值取10^{}时,score最大,为{},运行时间为{}秒'.format(alpha_range[i],gamma_range[j],cv_scores_Nyst.max(),end-start)) print(cv_scores_Nyst)
当惩罚系数alpha取10^-5、高斯核函数的gamma值取10^-1时,score最大,为0.590604308123932,运行时间为100.88983869552612秒 [[ 0.14738545 0.11121301 -0.00634036 -0.08838434 -0.11977817] [ 0.58485212 0.57181939 0.51956549 0.44955772 0.29771895] [ 0.59060431 0.5851441 0.52589562 0.48105512 0.39730037] [ 0.5878266 0.58427723 0.52762529 0.47229848 0.3941709 ] [ 0.58944038 0.58054867 0.52827467 0.47057277 0.38794852] [ 0.58958172 0.58522741 0.53486099 0.47879231 0.39311045]]
帮助文档说Nystroem函数比RBFSample慢一些,但更加精确,和代码运行结果大致相符合。
(3)多项式核函数
In [20]:
start = time.time() alpha_range=[-1,-3,-5,-7,-9,-11] degree_range=[1,2,3,4,5,6,7,8,9] cv_scores_poly=np.zeros(shape=(len(alpha_range),len(degree_range))) for i in range(len(alpha_range)): sdgclf=SGDRegressor(loss="epsilon_insensitive",alpha=10**alpha_range[i]) for j in range(len(degree_range)): rbf_feature = Nystroem(degree =degree_range[j], random_state=1) X_features = rbf_feature.fit_transform(cal_housing_scaled.iloc[:,0: -1]) scores = cross_val_score(sdgclf, X_features, cal_housing_scaled.iloc[:, -1], cv=5) cv_scores_poly[i,j] = scores.mean() end = time.time() m, n = cv_scores_poly.shape index = int(cv_scores_poly.argmax()) i = int(index / n) j = index % n print('当惩罚系数alpha取10^{}、多项式核函数的degree值取{}时,score最大,为{},运行时间为{}秒'.format(alpha_range[i],degree_range[j],cv_scores_poly.max(),end-start)) print(cv_scores_poly)
当惩罚系数alpha取10^-5、多项式核函数的degree值取3时,score最大,为0.5836778542094929,运行时间为245.93922114372253秒 [[0.1387308 0.14087102 0.13907231 0.14932641 0.14683597 0.14708831 0.14242401 0.13844941 0.14060674] [0.56763798 0.56812605 0.56974842 0.56641043 0.56690536 0.57121851 0.56881908 0.56988619 0.56971767] [0.57845136 0.57844851 0.58367785 0.57892279 0.58108542 0.58091855 0.57843517 0.58109156 0.57894842] [0.58244182 0.57824064 0.57810287 0.58273207 0.57906801 0.5822089 0.57929546 0.57593165 0.58037378] [0.5791144 0.5810481 0.57533389 0.58261548 0.58022263 0.57767742 0.58082774 0.58168407 0.5816219 ] [0.57711384 0.57705246 0.58137879 0.57963625 0.57937285 0.58090316 0.58025604 0.5781358 0.57887552]]