机器学习笔记《四》：线性回归，逻辑回归案例与重点细节问题分析

本篇文章主要是介绍逻辑回归处理真实案例中一些很关键的细节，是边写代码边说，更好理解。

1.读取数据/分析数据特点

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
path = "data"+os.sep+"creditcard.csv";
pdData = pd.read_csv(path);
pdData.head(4)

数据的结构为：Class 表示是否欺诈，其余为判断是否欺诈的数据，Time属性有些多余，后文会删掉

这里发现第一个问题（图中未出现）：Amount中的数字差值过大（144，50，3）
机器学习中普遍认为数字越大代表重要性越大，差值过大就会使得原来仅仅表示量的数字影响到了重要性的层面。可用下文的标准化处理。

来看看 0，1的分布情况
这里有个函数要重点看一下：value_counts()
下面提到了两种使用方法，这个函数的做用是返回一个属性中不同值出现的个数（频率）

class_count = pdData.Class.value_counts(sort=True)  #value_counts() 以Series形式返回指定列的**不同取值的频率**
#另外一种调用方法： pd.**value_counts**(pdData['Class'],sort=True)
class_count

机器学习笔记《四》：线性回归，逻辑回归案例与重点细节问题分析

看到pandas也有画图的功能，直接用数据集.plot就能画

#画出来   
class_count.plot(kind='bar')
plt.show()

机器学习笔记《四》：线性回归，逻辑回归案例与重点细节问题分析

可以发现第二个问题：数据超极不均衡， 0的样例远远大于1的样例，可用下采样，过采样处理

2.数据预处理标准化

sklearn 是一个非常重要的库，这里简单说明一下包含的方法

1.数据预处理

from sklearn import preprocessing
上一篇文章其实也提到了数据预处理，和这里的略微不同
scaled_data[:, 1:3] = pp.scale(data[:, 1:3])
scale 是以均值0，方差1进行标准化的
而这里是用数据自己的均值和方差进行标准化可以查看：?

这里还有一个fit() fit_transform() transform()之间的区分需要注意。
fit（）：计算出数据的均值和方差
fit_transform() 相当于先用fit（）计算出了均值并且保存起来了，然后再标准化
transform（）就是进行标准化所以fit_transform（）函数用过之后，之后的数据都只用transform就行了 ?

2.数据集切分

普通切分：
sklearn import cross_validation as cv
cv.train_test_split(X, y ,test_size=0.3,random_state=0)
交叉验证的切分：
from sklearn.cross_validation import KFold,cross_val_score

3.自带逻辑回归模型

from sklearn.linear_model import LogisticRegression

4.混淆矩阵/recall值

from sklearn.metrics import confusion_matrix,recall_score,classification_report

#对数据的值之间差距太大的Amount列进行标准化，缩小差距，但不影响排名
from sklearn import preprocessing as pp
#values：  以array形式返回指定column的所有取值
#reshape（行，列）:   表示重新设置 行列结构  只需要关注第二个列值，因为行值为-1时会自动计算，保证数据不变，否则你用自己的值（3，1）要是导致数据总量发生变化还会报错，所以老老实实用-1就行/

#标准化

pdData['normAmount'] = pp.StandardScaler().fit_transform(pdData['Amount'].values.reshape(-1,1))
# pdData['Amount'].shape   (284807,)
# pdData['Amount'].reshape(-1,1).shape    (284807, 1)
# pdData['Amount'].values.reshape(-1,1).shape (284807, 1)

print(pdData['Amount'].head())
print("标准化后的数据：")
print(pdData['normAmount'].head())

#删掉已经没有用的两列数据
pdData = pdData.drop(['Time','Amount'],axis = 1)   #1表示列，0表示行

机器学习笔记《四》：线性回归，逻辑回归案例与重点细节问题分析

解决了第一个问题，就来解决第二个问题，0，1分布太过失衡的问题
这里需要简单介绍一下下采样和过采样
***下采样***就是将类别当中数量更多的那个类别进行阉割，使它的数据和数量少的那个一样多。就是劫富，让富人和穷人一样穷

***过采样***则相反，是使用算法扩充类别中样本更少的那个类别，使它的数据和富人一样多。这叫济贫

loc ，iloc ，ix的区分：点我

还需要简单说一下 loc 和 iloc 的区别，pandans是用类似字典的方法取值的，如果你要取某个列的值data【列名】就ok，如果你要取的是某行某几行的值，就要定位行号或者索引了。只能使用loc/iloc/ix 了，loc就是根据索引进行查找索引可以是数字也可以是字符，iloc就不同了，它只能对从0开始的行号进行查找行。
ix就是最牛b的，既可以行号，也可以索引。

3.对数据进行下采样

步骤：

1.先要求出样本少的那个类有多少个数据 A，以及他们的索引 B
2.求出样本多的那个类的所有样本的索引 C
3. 从 C中随机选择 A个样本索引 D
4. 把索引D 和索引 B 连接起来，并且根据索引找到数据 E
5. E就是下采样后的数据

#分离  数据 和 结果 进行下采样
X = pdData.loc[:,pdData.columns != "Class"]
y = pdData.loc[:,pdData.columns == "Class"]

#对数据进行下采样

number_record_fraud = len(pdData[pdData.Class == 1])  #class = 1 的数目   492
fraud_indices = np.array(pdData[pdData.Class == 1].index)   #class = 1 的样例 索引

normal_indices = pdData[pdData.Class == 0].index #class = 0 的样例 索引
 
#从正常（class=0）中随机选出和欺诈样例相同的样例
random_normal_indices  =np.array( np.random.choice(normal_indices,number_record_fraud,replace=False))
#将 ‘随机正常’和 ‘欺诈’ 的索引拼起来并找到两者的数据

under_sample_indicis = np.concatenate([random_normal_indices,fraud_indices])

#np.concatenate([np.array([1,2,3]),np.array([4,5,6])])   array([1, 2, 3, 4, 5, 6])
under_sample_data = pdData.loc[under_sample_indicis]


#下采样后的数据和结果
X_undersample = under_sample_data.loc[:,under_sample_data.columns!='Class']
y_undersample = under_sample_data.loc[:,under_sample_data.columns =='Class']

# Showing ratio
print("Percentage of normal transactions正常人比例: ", len(under_sample_data[under_sample_data.Class == 0])/len(under_sample_data))
print("Percentage of fraud transactions欺诈者比例: ", len(under_sample_data[under_sample_data.Class == 1])/len(under_sample_data))
print("Total number of transactions in resampled data总数据: ", len(under_sample_data))

机器学习笔记《四》：线性回归，逻辑回归案例与重点细节问题分析

4.交叉验证

一般情况下，数据有限，需要珍惜。
在对数据进行操作时，会在一开时用train_test_split进行切分，得到大比例的train set 和小比例的 test set ，比例可以为（8:2，7:3）

这只是切分的第一步，对于80%的训练数据，一般来说做一次得到的模型偶然性太大，如果能够反复利用训练集，然后求平均这样得到的可能更加稳定可靠。所以这就出现了交叉验证，对那80%的训练集再进行切分成n份。然后进行n次训练，每次都取n-1份做训练数据，1份做测试数据。这样得到结果后再对一开始的20%训练数据进行测试，就可以说是十分完美了。


#这是求稳的做法

from sklearn import cross_validation as cv

#切分 训练集和测试集
#random_state不加这个参数每次的划分都是不一样的，加上后无论值等于几都保证是一样的

# 第一刀，切出30% test set和70% train set
X_train, X_test, y_train, y_test =cv.train_test_split(X, y ,test_size=0.3,random_state=0)
len(X_train)           #训练集样本数   199364


#对下采数据进行同样的操作，方便比较
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample =cv.train_test_split(X_undersample, y_undersample ,test_size=0.3,random_state=0)

len(X_train_undersample)  #   下采样的数据只有  688

机器学习笔记《四》：线性回归，逻辑回归案例与重点细节问题分析

这里还要说一个精度值和 recall 的区别

上一篇文章是使用的精度值对计算结果进行准确度测试，这其中其实是有问题的
如果1000 个人中有10个是瞎子，让你抽100个，你全都回答不是瞎子，那么你的准确度也至少为99.9% 所以这对正例小的样本很不科学
这就需要引入另外一个准确度判断的方法 recall了
recall = Tp/(Tp+Fp)
Tp : 正例猜成正例 Fp:正例猜成负例
TN ：负例猜成负例 FN: 负例猜成正例
recall一般会配上混淆矩阵进行演示，后文会说

正则化惩罚

正则化惩罚也是一个超级重要的点。

数据偶尔会发生过拟合?的情况：在训练集中表现很好，测试集上表现糟糕
这主要是因为有过多的变量（特征），同时只有非常少的训练数据，就会导致出现过度拟合的问题

正则化惩罚简而言之就是为目标函数的某些项增加一个惩罚系数阻止其过分影响数据的变化。可以看链接中的文章好好理解一下。

#开始建立模型分析

from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold,cross_val_score
from sklearn.metrics import confusion_matrix,recall_score,classification_report

def printing_Kfold_scores(X_train_data,y_train_data):
    #KFold交叉验证  把 X_train_data 分成5份 每次拿4份train 1份test 
    fold = KFold(len(X_train_data),5,shuffle=False);
    
    #正则化惩罚
    #设置惩罚系数数组，再从找到其中最好的作为系数
    c_param_range = [0.01,0.1,1,10,100]
    
    results_table = pd.DataFrame(index = range(len(c_param_range),2), columns = ['C_parameter','Mean recall score'])
    results_table['C_parameter'] = c_param_range
    
    #对每个惩罚系数进行循环
    j=0
    for c_param in c_param_range:
        print("----------------")
        print('C param:',c_param)
        print("----------------")
        recall_accs=[]  #存放recall值
        
#         for train,test in fold    每次便利fold可以拿到一个训练集索引，一个测试集索引，但是这里不这么取
        for iteration,indices in enumerate(fold,start=1): #in enumerate(x) 可以同时拿到索引，start=1表示索引从1开始
            #iteration表示索引 indices[0]是训练集的索引  indices[1]是测试集的索引
            
            #用确定的惩罚系数构造逻辑回归          #L2 = landa/2m * w^2
            lr = LogisticRegression(C=c_param,penalty = 'l1') #使用L1正则化  landa/2m * sum|w|
        
            # 使用训练集数据indices[0]修正模型
            # 用测试集 indices[1] 预测
            
            #放入 数据 和 结果 让它自己训练
           # 结果集 需要是一维的向量
            lr.fit(X_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())
    
            #用测试集 测试，得到 预测值
            y_pred_undersample = lr.predict(X_train_data.iloc[indices[1],:].values)
            
            #通过真实值与预测值 计算recall值 ，并加到数组中 
           #recall_score(真实值，预测值)
            recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,  y_pred_undersample)     
            recall_accs.append(recall_acc) 
            
            print('Iteration ', iteration,': recall score = ', recall_acc) 
            
           
        # 得到交叉验证n次后的recall平均值，最为这个惩罚参数 的 recall 值
        results_table.ix[ j , 'Mean recall score' ] = np.mean(recall_accs)
        j += 1
        print('')
        print('Mean recall score ', np.mean(recall_accs))
        print('')
     
    
    #根据所有惩罚系数得到recall判断最好的惩罚系数
    best_c = results_table.loc[results_table['Mean recall score'].idxmax()]['C_parameter']
    
    # Finally, we can check which C parameter is the best amongst the chosen.
    print('*********************************************************************************')
    print('Best model to choose from cross validation is with C parameter = ', best_c)
    print('*********************************************************************************')
    
    return best_c

#使用下采样的数据进行测试
best_c = printing_Kfold_scores(X_train_undersample,y_train_undersample)

机器学习笔记《四》：线性回归，逻辑回归案例与重点细节问题分析

画混淆矩阵

#画混淆矩阵
#参数   1.函数confusion_matrix（真实值，预测值）得到的混淆矩阵,这个函数得到的其实已经是结果了，这里只是将结果画出来
#      2.classes  值的集合    3.title  图名  4.
def plot_confusion_matrix(cm,classes,title='Confusion matrix',cmap=plt.cm.Blues):
   
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()  #图像右侧的渐变色条
    tick_marks = np.arange(len(classes))  # 坐标轴上的数值个数进行划分
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)
    
    thresh = cm.max() / 2.
    
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 #超过 最大值半数 色块上的文字用 white 否则 black
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout() # 会自动调整子图参数，使之填充整个图像区域
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

import itertools
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample = lr.predict(X_test_undersample.values)

# Compute confusion matrix
#  confusion_matrix 得到混淆矩阵，右下角是两正例 
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample)
print(cnf_matrix)

#https://blog.****.net/nockinonheavensdoor/article/details/80328074
np.set_printoptions(precision=2)  #设置输出为小数点后2位

print("测试集中得到的recall : ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))


# 画非标准的混淆矩阵
class_names = [0,1]   #值集合 
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

机器学习笔记《四》：线性回归，逻辑回归案例与重点细节问题分析

设置逻辑回归阈值

上一篇文章中用的阈值是0.5 ，这个值其实是可以改变的，通过这个值的改变可以有效调节数据结果

如果盼为正例的门槛太低。虽然所有的正例都判断出来了，但是也导致许多负俐判断成为正的了。

#设置逻辑回归阈值   本来sigmoid 是》=0.5 当作1，可以调整一下，使为1变难

lr = LogisticRegression(C = 0.01, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())


# predict 直接返回预测到的结果，默认>=0.5就是他
# predict_proba 返回的是预测属于某标签的概率（如 = 1的概率为 0.6，会返回0.6而不是1） 
y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values)

thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]   #sigmoid 阈值

plt.figure(figsize=(10,10))

j = 1
for i in thresholds:
    
    #大于阈值的 就当1
    y_test_predictions_high_recall = y_pred_undersample_proba[:,1] > i
    
    plt.subplot(3,3,j)  # 第j/9 个子图
    j+=1
    #得到混淆矩阵
    cnf_matrix = confusion_matrix(y_test_undersample,y_test_predictions_high_recall)
    
    

    np.set_printoptions(precision=2)  #设置输出数据的精度
    print("测试集中得到的recall : ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))
    class_names = [0,1]
    plot_confusion_matrix(cnf_matrix
                          , classes=class_names
                          , title='Threshold >= %s'%i) 
   
plt.show()

机器学习笔记《四》：线性回归，逻辑回归案例与重点细节问题分析

过采样

过采样使用的是k近邻算法

#过采样
#SMOTR :  找到少数类样本

import pandas as pd
from imblearn.over_sampling import SMOTE    #过采样  
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split


credit_cards=pd.read_csv('data/creditcard.csv')
columns=credit_cards.columns

# 最后一行是‘Class’，表示结果，删掉就是数据了
features_columns=columns.delete(len(columns)-1)   #表示数据的那些列名

features = credit_cards[features_columns]  #得到样本
labels=credit_cards['Class']  #标签 也就是结果


#切分，先分出80%的训练集，再对训练集进行交叉验证，不动测试集
#得到的就是样本集，不是索引集

features_train, features_test, labels_train, labels_test = train_test_split(features, 
                                                                            labels, 
                                                                            test_size=0.2, 
                                                                            random_state=0)
#拿到过采样   的函数
oversampler = SMOTE(random_state=0)


#放入训练数据，自动进行过采样    fit_sample（训练数据 ，训练结果）
#会自动将样本少的类别通过算法增加其样本数
#得到的是过采样 后的数据和结果   都是数据集不是索引集，类型为numpy.ndarray

os_features,  os_labels=  oversampler.fit_sample(features_train,labels_train)

len(os_labels[os_labels==1])    #  class=1 的 多了很多

机器学习笔记《四》：线性回归，逻辑回归案例与重点细节问题分析

#为了得到最好的参数
#需要把类型从numpy.ndarray   转化为  pd.DataFrame
#因为 printing_Kfold_scores  函数参数要求

os_features   =   pd.DataFrame(os_features)  
os_labels = pd.DataFrame(os_labels)   
best_c = printing_Kfold_scores(os_features,os_labels)  #进行混淆矩阵和recall的计算

机器学习笔记《四》：线性回归，逻辑回归案例与重点细节问题分析

#得到最好的参数，然后就对20%训练集开始下手，看看效果如何

import itertools
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(os_features,os_labels.values.ravel())
#对 一开始的20%训练集进行测试了
y_pred = lr.predict(features_test.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(labels_test,y_pred)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

机器学习笔记《四》：线性回归，逻辑回归案例与重点细节问题分析

机器学习笔记《四》：线性回归，逻辑回归案例与重点细节问题分析

1.读取数据/分析数据特点

2.数据预处理 标准化

1.数据预处理

2.数据集切分

3.自带逻辑回归模型

4.混淆矩阵/recall值

3.对数据进行下采样

4.交叉验证

这里还要说一个 精度值 和 recall 的区别

正则化惩罚

画混淆矩阵

设置逻辑回归阈值

过采样

相关推荐

2.数据预处理标准化

这里还要说一个精度值和 recall 的区别