机器学习:SVC实战+源码解读(支持向量机用于分类)

这部分只是对支持向量机sklearn库函数调用,参数解释,以及各个参数对测试结果的影响分析。个人认为入门机器学习实战的最快实例。

一.线性分类SVM

调用sklearn包中的LinearSVC

下面是调用的初始值:

 def __init__(self, penalty='l2', loss='squared_hinge', dual=True, tol=1e-4,
                 C=1.0, multi_class='ovr', fit_intercept=True,
                 intercept_scaling=1, class_weight=None, verbose=0,
                 random_state=None, max_iter=1000)

其中参数如下:

C:一个浮点数,惩罚参数

Loss:字符串,表示损失函数,

         ‘hinge’:合页损失,标准SVM的损失函数;

         ‘squared_hinge’:合页损失的平方。

penalty:指定‘l1’或者'l2',惩罚项的范数,默认为‘l2’

dual: 布尔值,如果为true,则解决对偶问题;如果是false,则解决原始问题。当样本数大于特征数,倾向于false;

tol:指定终止迭代的阈值。

multi_class:指定多分类策略,

                   ‘ovr’:采用one-vs-rest分类策略;默认

                  'crammer_singer':多类联合分类,很少用

fit_intercept:布尔值,如果为true,则计算截距,即决策函数中的常数项,否则,忽略截距。

class_weight:可以是一个字典,指定各个类的权重,若未提供,则认为类的权重为1;

其属性如下:

coef_:返回各个特征的权重。

intercept_:一个数组,决策函数的常数项

方法:

fit(x,y):训练模型

predict(x):用模型进行预测,返回预测值;

score(x,y):返回预测准确率

1.简单的线性可分支持向量机

from  sklearn  import svm, datasets,cross_validation
#加载鸢尾花数据集
def load_data_classfication():
    iris=datasets.load_iris()
    X_train=iris.data
    y_train=iris.target
    return cross_validation.train_test_split(X_train,y_train,test_size=0.25,random_state=0,stratify=y_train)

#使用svc分类器对其分类
def test_LinearSVC(*data):
    X_train,X_test,y_train,y_test=data
    cls=svm.LinearSVC()
    cls.fit(X_train,y_train)
    print('Score:%.2f'% cls.score(X_test,y_test))
    print('Coefficients:%s,intercept:%s'%(cls.coef_,cls.intercept_))

X_train,X_test,y_train,y_test=load_data_classfication()
test_LinearSVC(X_train,X_test,y_train,y_test)

将鸢尾花数据分类三类,其中每一个样本都有四个特征,花萼的长宽,花瓣的长宽

Score:0.97
Coefficients:[[ 0.20959539  0.3992385  -0.81739225 -0.44231995]
 [-0.12568317 -0.78049158  0.51754286 -1.0251776 ]
 [-0.80282255 -0.87632578  1.21360546  1.80985929]],intercept:[ 0.1197369   2.02403826 -1.4441551 ]

上面是检测结果,以及每个特征对于分类的影响

2.考察损失函数对于分类结果的影响

def test_LinearSVC_loss(*data):
    X_train, X_test, y_train, y_test = data
    losses=['hinge','squared_hinge']
    for loss in losses:
        cls=svm.LinearSVC(loss=loss)
        cls.fit(X_train,y_train)
        print("loss:%s"%loss)
        print('Coefficients:%s,intercept %s'%(cls.coef_,cls.intercept_))
        print('Score:%.2f' % cls.score(X_test, y_test))
X_train,X_test,y_train,y_test=load_data_classfication()
test_LinearSVC_loss(X_train,X_test,y_train,y_test)

结果:

loss:hinge
Coefficients:[[ 0.36636819  0.32163591 -1.07533939 -0.57004684]
 [ 0.46789769 -1.55869702  0.39897097 -1.34402625]
 [-1.21507089 -1.15278116  1.84779924  1.98442388]],intercept [ 0.18050217  1.36129524 -1.42624543]
Score:0.97
loss:squared_hinge
Coefficients:[[ 0.20959889  0.39924259 -0.8173887  -0.44231943]
 [-0.12966115 -0.78597293  0.52178887 -1.02424942]
 [-0.80322699 -0.87607737  1.21376071  1.81009443]],intercept [ 0.11973969  2.04293055 -1.44409524]
Score:0.97

3.惩罚项的系数对结果的影响,C衡量了误分类点的重要性,C越大则错误分类点越重要

def test_LinearSVC_C(*data):
    X_train, X_test, y_train, y_test = data
    Cs=np.logspace(-2,1)
    train_score=[]
    test_score=[]
    for C in Cs:
        cls=svm.LinearSVC(C=C)
        cls.fit(X_train,y_train)
        train_score.append(cls.score(X_train,y_train))
        test_score.append(cls.score(X_test,y_test))
    fig=plt.figure()
    ax=fig.add_subplot(1,1,1)
    ax.plot(Cs,train_score,label="Traing score")
    ax.plot(Cs,test_score,label="test score")
    ax.set_xlabel(r"C")
    ax.set_ylabel(r"score")
    ax.set_xscale('log')
    ax.set_title("LinearSVC")
    ax.legend(loc='best')
    plt.show()
X_train,X_test,y_train,y_test=load_data_classfication()
test_LinearSVC_C(X_train,X_test,y_train,y_test)

机器学习:SVC实战+源码解读(支持向量机用于分类)

可以看出,C越小,错误分类点重要性低,误分类点较多,分类器性能较差。

 二、非线性分类器

from sklearn import svm
svm.SVC()
#下面是初始化参数  
def __init__(self, C=1.0, kernel='rbf', degree=3, gamma='auto',
                 coef0=0.0, shrinking=True, probability=False,
                 tol=1e-3, cache_size=200, class_weight=None,
                 verbose=False, max_iter=-1, decision_function_shape='ovr',
                 random_state=None):

非线性分类器就是引入了核函数,将在原始空间中的特征映射到更高维度,用于分类

参数解释:

keral:指定核函数

             'linear':线性核函数

          'poly':多项式核函数

             'rbf':高斯核函数

               'sigmoid':S核函数

属性解释:

support_:一个数组,支持向量下标

support_vectors_:一个数组,支持向量

方法:

predict_log_proba(x):返回一个数组,数组元素依次是预测为各个类别的概率的对数值。

predict_proba(x):返回一个数组,数组的元素依次是预测为各个类别的概率值

1.查看多项式中的参数如何影响测试结果

def test_SVC_poly(*data):
    X_train, X_test, y_train, y_test = data
    fig=plt.figure()
    degrees=range(1,20)
    train_score = []
    test_score = []
    for degree in degrees:
        cls = svm.SVC(kernel='poly',degree=degree)
        cls.fit(X_train, y_train)
        train_score.append(cls.score(X_train, y_train))
        test_score.append(cls.score(X_test, y_test))
    ax = fig.add_subplot(1, 3, 1)
    ax.plot(degrees, train_score, label="Traing score")
    ax.plot(degrees, test_score, label="test score")
    ax.set_xlabel("p")
    ax.set_ylabel("score")
    ax.set_ylim(0,1.05)
    ax.set_title("SVC_poly_degree")
    ax.legend(loc='best',framealpha=0.5)
    ##测试gamma
    gammas=range(1,20)
    train_score = []
    test_score = []
    for gamma in gammas:
        cls = svm.SVC(kernel='poly', gamma=gamma,degree=3)
        cls.fit(X_train, y_train)
        train_score.append(cls.score(X_train, y_train))
        test_score.append(cls.score(X_test, y_test))
    ax = fig.add_subplot(1, 3, 2)
    ax.plot(degrees, train_score, label="Traing score")
    ax.plot(degrees, test_score, label="test score")
    ax.set_xlabel(r"$gamma$")
    ax.set_ylabel("score")
    ax.set_ylim(0, 1.05)
    ax.set_title("SVC_poly_gamma")
    ax.legend(loc='best', framealpha=0.5)

    ###测试R
    rs=range(0,20)
    train_score = []
    test_score = []
    for r in rs:
        cls = svm.SVC(kernel='poly',gamma=10,degree=3,coef0=r)
        cls.fit(X_train, y_train)
        train_score.append(cls.score(X_train, y_train))
        test_score.append(cls.score(X_test, y_test))
    ax = fig.add_subplot(1, 3, 3)
    ax.plot(rs, train_score, label="Traing score")
    ax.plot(rs, test_score, label="test score")
    ax.set_xlabel(r"r")
    ax.set_ylabel("score")
    ax.set_ylim(0, 1.05)
    ax.set_title("SVC_poly_r")
    ax.legend(loc='best')
    plt.show()

X_train,X_test,y_train,y_test=load_data_classfication()
test_SVC_poly(X_train,X_test,y_train,y_test)

机器学习:SVC实战+源码解读(支持向量机用于分类)

2.高斯核函数的影响

def test_SVC_rbf(*data):
    X_train, X_test, y_train, y_test = data

    gammas = range(1, 20)
    train_score = []
    test_score = []
    for gamma in gammas:
        cls = svm.SVC(kernel='rbf', gamma=gamma)
        cls.fit(X_train, y_train)
        train_score.append(cls.score(X_train, y_train))
        test_score.append(cls.score(X_test, y_test))
    fig=plt.figure()
    ax = fig.add_subplot(1, 1, 1)
    ax.plot(gammas, train_score, label="Traing score")
    ax.plot(gammas, test_score, label="test score")
    ax.set_xlabel(r"$gammas$")
    ax.set_ylabel("score")
    ax.set_ylim(0, 1.05)
    ax.set_title("SVC_rbf")
    ax.legend(loc='best', framealpha=0.5)
    plt.show()
X_train,X_test,y_train,y_test=load_data_classfication()
test_SVC_rbf(X_train,X_test,y_train,y_test)

机器学习:SVC实战+源码解读(支持向量机用于分类)

预测性能随gammas变化比较平稳。

参考书籍:

《Python大战机器学习》