Python 数据科学指南2.3 使用scikit-learn进行机器学习

scikit-learn是Python中的一个全能的机器学习库。

示例代码1:

#为了使用内置的数据集，我们得先加载scikit-learn库，库的模块里包含着各种各样的函数。
from sklearn.datasets import load_iris,load_boston,make_classification,make_circles,make_moons

#Iris数据集，由Donald Fisher先生引入的分类问题的经典数据集。
data = load_iris() #调用 load_iris()函数，返回一个字典。
x = data['data'] #使用key值data获取预测器x
y = data['target'] #使用key值target获取因变量的值
x_labels = data['feature_names']
y_labels = data['target_names']
print (x.shape) #(150,4) 150个实例，4个属性
print (y.shape) #因变量由150个实例
print (x_labels)
print (y_labels) #三种类别

#Boston数据集，住房数据集，它属于回归问题
#依然是包括预测器和因变量
data = load_boston()
x = data['data']
y = data['target']
x_labels = data['feature_names']
print(x.shape)
print(y.shape)
print(x_labels)

#制作一些分类数据集
#make_classification函数用来产生分类数据集。n_samples指定生成的样本数，n_features指定生成的属性，n_classes指定生成的类集合。
x,y = make_classification(n_samples = 50,n_features =5,n_classes=2)

print(x.shape)
print(y.shape)
print(x[1,:]) #打印预测器集合x里的第2条记录
print(y[1]) #预测器里的第2条记录的类别标签

#一些非线性数据机
#通过make_circles()产生同心圆 x是两个变量的数据集（横纵坐标），y是类标签
x,y = make_circles()
import numpy as np
import matplotlib.pyplot as plt
plt.close('all')
plt.figure(1)
plt.scatter(x[:,0],x[:,1],c=y)

#通过make_moons()函数产生新月图 图形说明预测器集合x里的属性之间的关系是非线性的
x,y = make_moons()
plt.figure(2)
plt.scatter(x[:,0],x[:,1],c=y)

plt.show()

输出结果：

Python 数据科学指南2.3 使用scikit-learn进行机器学习 — Figure1

示例代码2:学习如何调用scikit-learn里的机器学习函数

scikit-learn中所有实现机器学习方法的类都来自BaseEstimator。其要求用以实现的类提供fit和transform两种方法。

import numpy as np
#使用PolynomialFeatures类来演示scikit-learn的SDK的方便快捷之处。
#有时我们需要往预测器变量集合中增加新变量，以判断模型精度是否提高。我们可以将Polynomial中已有的特征转换为新特征。
from sklearn.preprocessing import PolynomialFeatures

#数据预处理
#创建数据集 数据集中有两个实例和两个属性。
x = np.asmatrix([[1,2],[2,4]])
#通过polynomials维度来实例化PolynomialFeature类
poly = PolynomialFeatures(degree=2)
#fit函数用来在数据转换时做必须的计算
poly.fit(x)
#transform函数接收输入数据，并基于fit函数的计算结果将输入数据进行转换。
x_poly = poly.transform(x)

print ("Original x variable shape",x.shape)
print (x)
print("Transformed x variable",x_poly.shape)
print(x_poly)

#另一种写法
x_poly = poly.fit_transform(x)

#从tree模块中导入DecisionTreeClassifier类，它实现了决策树算法
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris

data = load_iris()
x = data['data']
y = data['target']

#实例化DecisionTreeClassifier对象
estimator = DecisionTreeClassifier()
#调用fit函数，传递预测器x和因变量y来简历模型
estimator.fit(x,y)
#用predoct函数对给定的输入预测其类标签。
predicted_y = estimator.predict(x)
#下面一个给出预测的概率，一个给出预测概率的对数
predicted_y_prob = estimator.predict_proba(x)
predicted_y_lprob = estimator.predict_log_proba(x)


#pipeline 使用该功能，不同的机器学习算法可以被链接在一起。
from sklearn.pipeline import Pipeline
poly = PolynomialFeatures(degree=3)
tree_estimator = DecisionTreeClassifier()
#定义一个元组列表标示我们的链接。运行多项式特征生成器之后，再执行决策树。
steps = [('poly',poly),('tree',tree_estimator)]
#我们通过steps变量声明的列表将Pipeline对象实例化。
estimator = Pipeline(steps=steps)
estimator.fit(x,y)
predicted_y = estimator.predict(x)

Python 数据科学指南2.3 使用scikit-learn进行机器学习

相关推荐