SVM实例，两个数据，两个例子。cancer data样本量小，分类数据用svc函数，较为简单；houseprice样本量大，连续数据，用了支持向量回归SVR，函数先用了RBFSampler和 Nystroem做核映射，然后用SGDRegressor做支持向量回归，使用的这三个函数都很适合大样本。

I. 准备

1.import...

In [1]:

import pandas as pd
import numpy as np
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats,integrate
sns.set_style("darkgrid")
sns.set(color_codes=True)
plt.rcParams['font.sans-serif'] = ['SimHei']  # 绘图时可以显示中文
plt.rcParams['axes.unicode_minus']=False   # 绘图时显示负号
warnings.filterwarnings("ignore")  # 不要显示警告

II. cancer data

1.read data

In [2]:

cancer = pd.read_excel('C:\\Users\\91333\\Documents\\semester6\\data science\\3.NB&DT\\Week3_CancerDataset.xlsx')

2.数据标准化

In [3]:

cancer_scaled = cancer.iloc[:,1:-1].apply(lambda x: (x - np.mean(x)) / (np.std(x)))

3. 训练SVM模型

对于三种常见的核函数，分别训练SVM模型，并调节参数

（1）线性核函数

线性核函数没有映射到高维度空间，由于是soft margin svm,所以需要调节的参数只有容忍错判的样本的惩罚系数C，用cv结果来挑选C。

In [4]:

from sklearn.svm import SVC,SVR,LinearSVR
from sklearn.model_selection import cross_val_score
C_range = range(1,31)
cv_scores = []
for c in C_range:
    clf = SVC(C=c,kernel = 'linear')
    scores = cross_val_score(clf, cancer_scaled, cancer.iloc[:, -1], cv=5, scoring='accuracy')
    cv_scores.append(scores.mean())
print('当松弛变量C取{}时，score最大，为{}'.format(C_range[cv_scores.index(max(cv_scores))],max(cv_scores)))

plt.figure(figsize=(10,4))
plt.plot(C_range,cv_scores)
plt.xlabel("C")
plt.ylabel("score")
plt.title("score随惩罚系数C变化图")

当松弛变量C取6时，score最大，为0.973728357060408

Out[4]:

Text(0.5, 1.0, 'score随惩罚系数C变化图')

python 实例 SVM SVR cv 核函数 LinearSVR、RBFSampler、 SGDRegressor和 Nystroem的使用

（2）高斯核函数

由于存在两个参数松弛变量C和gamma,使用GridSearchCV函数进行网格搜索。

In [5]:

from sklearn.model_selection import GridSearchCV 
from sklearn.metrics import fbeta_score, make_scorer
ftwo_scorer = make_scorer(fbeta_score, beta=2)
grid1 = GridSearchCV(SVC(kernel = 'rbf'), param_grid={'C': [1,10,20],'gamma': [0.09,0.009,0.0009,0.00009]},
                    scoring=ftwo_scorer, cv=5)
grid1.fit(cancer_scaled, cancer.iloc[:, -1])
print('使用GridSearchCV网格搜索后，最优的松弛变量C取值为{}，最优的gamma取值为{}，该模型score为{}'.format( 
    grid1.best_estimator_.get_params()['C'],grid1.best_estimator_.get_params()['gamma'],grid1.best_score_))

使用GridSearchCV网格搜索后，最优的松弛变量C取值为10，最优的gamma取值为0.009，该模型score为0.9882421118919705

（3）多项式核函数

存在两个参数松弛变量C和阶数,同样利用GridSearchCV函数

In [6]:

grid2 = GridSearchCV(SVC(kernel = 'poly'), param_grid={'C': [1,5,10,15,20],'degree': [1,2,3,4,5,6,7,8,9]},
                    scoring=ftwo_scorer, cv=5)
grid2.fit(cancer_scaled, cancer.iloc[:, -1])
print('使用GridSearchCV网格搜索后，最优的松弛变量C取值为{}，最优的阶数取值为{}，该模型score为{}'.format( 
    grid2.best_estimator_.get_params()['C'],grid2.best_estimator_.get_params()['degree'],grid2.best_score_))

plt.scatter(x=grid2.cv_results_['param_C'].data, y=grid2.cv_results_['param_degree'].data,
            s=1000*(grid2.cv_results_['mean_test_score']-min(grid2.cv_results_['mean_test_score'])+0.1),
            cmap=plt.cm.get_cmap('RdYlBu'))
plt.xlabel("C")
plt.ylabel("degree")
plt.title("score随惩罚系数C和degree变化散点图")
plt.annotate('圆圈大小表示score大小', xy=(1,1), xytext=(1, -1),color='b',size=10)
plt.annotate('这个点最优',xy=(1,1),xytext=( grid2.best_estimator_.get_params()['C']+1,grid2.best_estimator_.get_params()['degree']),color='r')

使用GridSearchCV网格搜索后，最优的松弛变量C取值为20，最优的阶数取值为3，该模型score为0.9867086284245016

Out[6]:

Text(21, 3, '这个点最优')

python 实例 SVM SVR cv 核函数 LinearSVR、RBFSampler、 SGDRegressor和 Nystroem的使用

奇偶degree取值使得score交替变化，奇特。

4. 写出超平面

尝试了一下，只有没有映射到高维空间的线性核函数SVM可以取出超平面表达式,设超平面h(x)=wx+b。

(1)方向w

In [7]:

clf = SVC(kernel = 'linear')
clf.fit(cancer_scaled, cancer.iloc[:, -1])
print(clf.coef_)

[[-8.80113094e-02 -3.59306843e-01 -3.35013827e-01 -8.39562339e-04
   6.46330866e-01 -7.45104189e-01 -9.27479275e-01 -7.01658929e-02
   3.81723561e-01 -8.64975132e-01  3.24937431e-01 -2.37246961e-01
  -8.85892419e-01 -3.70854979e-01  3.81626684e-01  3.71302098e-01
  -4.33843461e-01  8.46263549e-02  8.92709596e-01 -6.49610348e-01
  -1.00930557e+00 -3.76030776e-01 -7.55483286e-01 -3.94390435e-01
   1.53334795e-01 -1.02528284e+00 -1.18490959e-01 -4.30099332e-01
  -8.64181126e-01]]

这是一个和样本feature数同维数的向量，31维。

(2)截距b

In [8]:

print(clf.intercept_ )

[0.00091257]

III. houseprice

1. read data

In [9]:

cal_housing = pd.read_csv('C:\\Users\\91333\\Documents\\semester6\\data science\\5.SVM\\房价预测\\cal_housing.data',
                          header=None,names=['longitude','latitude','Age','Rooms','Bedrooms','population','households','Income','HouseValue'])

2.探索一下

1.前5行

In [10]:

cal_housing.head(5)

Out[10]:

	longitude	latitude	Age	Rooms	Bedrooms	population	households	Income	HouseValue
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0

2.可视化

(1)经纬度和房屋价格的散点图

In [11]:

plt.figure(figsize=[17,7])
plt.subplot(1,2,1)
plt.scatter(cal_housing['longitude'].values,cal_housing['HouseValue'].values,alpha=0.05)
plt.xlabel("longitude")
plt.ylabel("MedianHouseValue")
plt.title("scatterplot of longitude and MedianHouseValue",fontsize="x-large")
plt.subplot(1,2,2)
plt.scatter(cal_housing['latitude'].values,cal_housing['HouseValue'].values,alpha=0.05)
plt.xlabel("latitude")
plt.ylabel("MedianHouseValue")
plt.title("scatterplot of latitude and MedianHouseValue",fontsize="x-large")

Out[11]:

Text(0.5, 1.0, 'scatterplot of latitude and MedianHouseValue')

python 实例 SVM SVR cv 核函数 LinearSVR、RBFSampler、 SGDRegressor和 Nystroem的使用

无论经度还是纬度，都主要有两种区间：第一种区间上，房屋价格的分布考下，很多都在200000以下；第二种区间上，各种价格的房屋都存在，而且都中间价格的多，点更密集，两头更稀疏。图中的规律和现实大致相符，在主城区各种价格的房屋都有，而在乡村偏远地区，主要是价格低的房屋。

2）收入和房屋价格的核密度图

In [12]:

sns.jointplot(x='Income',y='HouseValue',data=cal_housing,kind='kde')

Out[12]:

<seaborn.axisgrid.JointGrid at 0x267bd149c88>

python 实例 SVM SVR cv 核函数 LinearSVR、RBFSampler、 SGDRegressor和 Nystroem的使用

收入和房屋价格正相关，符合生活经验。其他变量与房屋价格的双变量图像效果不好，我就没展示。

3. 数据预处理

(1)数据标准化

In [13]:

cal_housing_scaled = cal_housing.apply(lambda x: (x - np.mean(x)) / (np.std(x)))

(2)主成分分析

主成分分析的目的是降维，加快后面SVR的速度。

In [14]:

from sklearn.decomposition import PCA
pca=PCA().fit(cal_housing_scaled.iloc[:,0:-1])
print('方差累计贡献率为{}'.format(np.cumsum(pca.explained_variance_ratio_)))

方差累计贡献率为[0.48833528 0.7268287  0.86082389 0.96357371 0.98208055 0.99228857
 0.99815104 1.        ]

前四个主成分累计贡献了0.96的方差变异，所以选取前四个主成分就可以。

In [15]:

pca_cal=pca.fit_transform(cal_housing_scaled.iloc[:,0:-1])[:,0:4]

4.训练SVR模型

(1)线性核函数SVR

sklearn的帮助文档说，当线性核函数时，由于LinearSVR用的是liblinear，而SVR用的是libsvm，LinearSVR函数要比SVR函数快，我试了一下果然。

a.PCA后

In [16]:

import time
start = time.time()
cv_scores_pca_linear = []
for c in C_range:
    svr = LinearSVR(C=1/c)
    scores = cross_val_score(svr, pca_cal, cal_housing_scaled.iloc[:, -1], cv=5)
    cv_scores_pca_linear.append(scores.mean())
end = time.time()

print('当松弛变量C取{}时，score最大，为{},运行时间为{}秒'.format(
    C_range[cv_scores_pca_linear.index(max(cv_scores_pca_linear))],max(cv_scores_pca_linear),end-start))
plt.figure(figsize=(10,4))
plt.plot(C_range,cv_scores_pca_linear)
plt.xlabel("C")
plt.ylabel("score")
plt.title("PCA后LinearSVR函数score随惩罚系数C变化图")

当松弛变量C取26时，score最大，为0.4471613979532877,运行时间为62.31103038787842秒

Out[16]:

Text(0.5, 1.0, 'PCA后LinearSVR函数score随惩罚系数C变化图')

python 实例 SVM SVR cv 核函数 LinearSVR、RBFSampler、 SGDRegressor和 Nystroem的使用

b.PCA前

In [17]:

start = time.time()
cv_scores_linear = []
for c in C_range:
    svr = LinearSVR(C=1/c)
    scores = cross_val_score(svr, cal_housing_scaled.iloc[:, 0:-1], cal_housing_scaled.iloc[:, -1], cv=5)
    cv_scores_linear.append(scores.mean())
end = time.time()
print('当松弛变量C取{}时，score最大，为{},运行时间为{}秒'.format(
    C_range[cv_scores_linear.index(max(cv_scores_linear))],max(cv_scores_linear),end-start))

plt.figure(figsize=(10,4))
plt.plot(C_range,cv_scores)
plt.xlabel("C")
plt.ylabel("score")
plt.title("PCA前LinearSVR函数score随惩罚系数C变化图")

当松弛变量C取10时，score最大，为0.5756157136922346,运行时间为84.73883938789368秒

Out[17]:

Text(0.5, 1.0, 'PCA前LinearSVR函数score随惩罚系数C变化图')

python 实例 SVM SVR cv 核函数 LinearSVR、RBFSampler、 SGDRegressor和 Nystroem的使用

PCA前的总运行时间比PCA后的总运行时间多了约20秒，但是正确率低了10%以上，无法容忍，所以下面用PCA前的数据。

(2)高斯核函数

a.核函数映射用RBFSampler函数

借助两个函数RBFSampler和SGDRegressor函数，利用RBFSampler先把原数据映射到feature空间，然后再进行SVR操作。帮助文档说，SGDRegressor函数也可以做SVR，参数alpha是惩罚系数，比LinearSVR的C要小很多，alpha的默认值alpha=0.0001，C的默认值是1，我也不知道是怎么定义的。 SGDRegressor函数真的好快，帮助文档推荐1万以上的样本用SGDRegressor函数。

In [18]:

from sklearn.kernel_approximation import RBFSampler
from sklearn.linear_model import SGDRegressor
start = time.time()
alpha_range=[-1,-3,-5,-7,-9,-11]
gamma_range=[-1,-1.5,-2,-2.5,-3]

cv_scores_RBFS=np.zeros(shape=(len(alpha_range),len(gamma_range)))
for i in range(len(alpha_range)):
    sdgclf=SGDRegressor(loss="epsilon_insensitive",alpha=10**alpha_range[i])
    for j in range(len(gamma_range)):
        rbf_feature = RBFSampler(gamma = 10**gamma_range[j], random_state=1)
        X_features = rbf_feature.fit_transform(cal_housing_scaled.iloc[:,0: -1])
        scores = cross_val_score(sdgclf, X_features, cal_housing_scaled.iloc[:, -1], cv=5)
        cv_scores_RBFS[i,j] = scores.mean()
end = time.time()     
m, n = cv_scores_RBFS.shape
index = int(cv_scores_RBFS.argmax())
i = int(index / n)
j = index % n
print('当惩罚系数alpha取10^{}、高斯核函数的gamma值取10^{}时，score最大，为{}，运行时间为{}秒'.format(alpha_range[i],gamma_range[j],cv_scores_RBFS.max(),end-start))
print(cv_scores_RBFS)

当惩罚系数alpha取10^-11、高斯核函数的gamma值取10^-1.5时，score最大，为0.5706462592558829，运行时间为100.89476466178894秒
[[ 0.12197254  0.09333182 -0.01330377 -0.08902908 -0.11553457]
 [ 0.55536832  0.5593231   0.52558767  0.46167317  0.28025331]
 [ 0.56701374  0.57042556  0.53751988  0.49345061  0.39564883]
 [ 0.5656155   0.5686655   0.53074254  0.49179496  0.39597599]
 [ 0.5672084   0.56981525  0.53735416  0.49079757  0.39544188]
 [ 0.5636954   0.57064626  0.53562283  0.49084243  0.39619231]]

b.核函数映射用Nystroem函数

In [19]:

from sklearn.kernel_approximation import Nystroem
start = time.time()
alpha_range=[-1,-3,-5,-7,-9,-11]
gamma_range=[-1,-1.5,-2,-2.5,-3]
cv_scores_Nyst=np.zeros(shape=(len(alpha_range),len(gamma_range)))
for i in range(len(alpha_range)):
    sdgclf=SGDRegressor(loss="epsilon_insensitive",alpha=10**alpha_range[i])
    for j in range(len(gamma_range)):
        rbf_feature = Nystroem(gamma = 10**gamma_range[j], random_state=1)
        X_features = rbf_feature.fit_transform(cal_housing_scaled.iloc[:,0: -1])
        scores = cross_val_score(sdgclf, X_features, cal_housing_scaled.iloc[:, -1], cv=5)
        cv_scores_Nyst[i,j] = scores.mean()
end = time.time()     
m, n = cv_scores_Nyst.shape
index = int(cv_scores_Nyst.argmax())
i = int(index / n)
j = index % n
print('当惩罚系数alpha取10^{}、高斯核函数的gamma值取10^{}时，score最大，为{}，运行时间为{}秒'.format(alpha_range[i],gamma_range[j],cv_scores_Nyst.max(),end-start))
print(cv_scores_Nyst)

当惩罚系数alpha取10^-5、高斯核函数的gamma值取10^-1时，score最大，为0.590604308123932，运行时间为100.88983869552612秒
[[ 0.14738545  0.11121301 -0.00634036 -0.08838434 -0.11977817]
 [ 0.58485212  0.57181939  0.51956549  0.44955772  0.29771895]
 [ 0.59060431  0.5851441   0.52589562  0.48105512  0.39730037]
 [ 0.5878266   0.58427723  0.52762529  0.47229848  0.3941709 ]
 [ 0.58944038  0.58054867  0.52827467  0.47057277  0.38794852]
 [ 0.58958172  0.58522741  0.53486099  0.47879231  0.39311045]]

帮助文档说Nystroem函数比RBFSample慢一些，但更加精确，和代码运行结果大致相符合。

(3)多项式核函数

In [20]:

start = time.time()
alpha_range=[-1,-3,-5,-7,-9,-11]
degree_range=[1,2,3,4,5,6,7,8,9]
cv_scores_poly=np.zeros(shape=(len(alpha_range),len(degree_range)))
for i in range(len(alpha_range)):
    sdgclf=SGDRegressor(loss="epsilon_insensitive",alpha=10**alpha_range[i])
    for j in range(len(degree_range)):
        rbf_feature = Nystroem(degree =degree_range[j], random_state=1)
        X_features = rbf_feature.fit_transform(cal_housing_scaled.iloc[:,0: -1])
        scores = cross_val_score(sdgclf, X_features, cal_housing_scaled.iloc[:, -1], cv=5)
        cv_scores_poly[i,j] = scores.mean()
end = time.time()     
m, n = cv_scores_poly.shape
index = int(cv_scores_poly.argmax())
i = int(index / n)
j = index % n
print('当惩罚系数alpha取10^{}、多项式核函数的degree值取{}时，score最大，为{}，运行时间为{}秒'.format(alpha_range[i],degree_range[j],cv_scores_poly.max(),end-start))
print(cv_scores_poly)

当惩罚系数alpha取10^-5、多项式核函数的degree值取3时，score最大，为0.5836778542094929，运行时间为245.93922114372253秒
[[0.1387308  0.14087102 0.13907231 0.14932641 0.14683597 0.14708831
  0.14242401 0.13844941 0.14060674]
 [0.56763798 0.56812605 0.56974842 0.56641043 0.56690536 0.57121851
  0.56881908 0.56988619 0.56971767]
 [0.57845136 0.57844851 0.58367785 0.57892279 0.58108542 0.58091855
  0.57843517 0.58109156 0.57894842]
 [0.58244182 0.57824064 0.57810287 0.58273207 0.57906801 0.5822089
  0.57929546 0.57593165 0.58037378]
 [0.5791144  0.5810481  0.57533389 0.58261548 0.58022263 0.57767742
  0.58082774 0.58168407 0.5816219 ]
 [0.57711384 0.57705246 0.58137879 0.57963625 0.57937285 0.58090316
  0.58025604 0.5781358  0.57887552]]

python 实例 SVM SVR cv 核函数 LinearSVR、RBFSampler、 SGDRegressor和 Nystroem的使用

I. 准备

1.import...

II. cancer data

1.read data

2.数据标准化

3. 训练SVM模型

4. 写出超平面

III. houseprice

1. read data

2.探索一下

3. 数据预处理

4.训练SVR模型

相关推荐