python大规模机器学习day6-描述目标

描述目标

实验要求：
1.学习分类计数及其频率值计算
实验内容：
1.研究森林覆盖值，用分类的思想处理它
代码注释：
import os,csv
local_path=os.getcwd()
source=‘covtype.data’
SEP=’,’
forest_type={1:“Spruce/Fir”, 2:“Lodgepole Pine”, 3:“Ponderosa Pine”, 4:“Cottonwood/Willow”, 5:“Aspen”, 6:“Douglas-fir”, 7:“Krummholz”} //森林覆盖类型，数据集中的最后一个数字代表树品种，前面的数据代表
forest_type_count={value:0 for value in forest_type.values()} //设置value的初始值，将每一个品种的数量设置为0
forest_type_count[‘Other’]=0
lodgepole_pine=0
spruce=0
proportions=list() //声明一个列表
with open(local_path+’\’+source,‘rt’) as R:
iterator = csv.reader(R, delimiter=SEP)
for n ,row in enumerate(iterator):
response=int(row[-1]) #The response is the last value //取每一行最后一个数字
try:
forest_type_count[forest_type[response]] +=1 //设置一个叠加器来动态记录每一个品种的数量
if response==1:
spruce +=1
elif response ==2:
lodgepole_pine +=1
if n % 10000 ==0:
proportions.append([spruce/float(n+1),lodgepole_pine/float(n+1)]) //此处操作是在人工观测可见的结果之后，取10000个实例的数据，方便画图
except:
forest_type_count[‘Other’] +=1 //如果末尾数字为其他数字，则归为other类，本案例中other类中value为0
print (‘Total rows:%i’ %(n+1))
print(‘Frequency of classes:’) //种类频繁度
for ftype, freq in sorted([(t,v) for t,v in forest_type_count.items()], key=lambda x:x[1],reverse=True): //sorted函数进行排序操作，reverse为true按从大到小排序，items可以遍历键值对序列并输出
print("%-18s:%6i %04.1f%%" %(ftype, freq, freq*100/float(n+1)))

代码2：
import matplotlib.pyplot as plt
import numpy as np
proportions = np.array(proportions) //将列表转化为二维数组
plt.figure()
plt.plot(proportions[:,0],‘r-’,label=‘Spruce/Fir’)
plt.plot(proportions[:,1],‘b-’,label=‘Lodgepole Pine’)
plt.ylim(0.0,0.8)
plt.xlabel(‘Training examples (unit=10000)’)
plt.ylabel(’%’)
plt.legend(loc=‘lower right’, numpoints= 1)
plt.show()

运行截图：
python大规模机器学习day6-描述目标

python大规模机器学习day6-描述目标

源代码：
import os,csv
local_path=os.getcwd()
source=‘covtype.data’
SEP=’,’
forest_type={1:“Spruce/Fir”, 2:“Lodgepole Pine”, 3:“Ponderosa Pine”, 4:“Cottonwood/Willow”, 5:“Aspen”, 6:“Douglas-fir”, 7:“Krummholz”}
forest_type_count={value:0 for value in forest_type.values()}
forest_type_count[‘Other’]=0
lodgepole_pine=0
spruce=0
proportions=list()
with open(local_path+’\’+source,‘rt’) as R:
iterator = csv.reader(R, delimiter=SEP)
for n ,row in enumerate(iterator):
response=int(row[-1]) #The response is the last value
try:
forest_type_count[forest_type[response]] +=1
if response==1:
spruce +=1
elif response ==2:
lodgepole_pine +=1
if n % 10000 ==0:
proportions.append([spruce/float(n+1),lodgepole_pine/float(n+1)])
except:
forest_type_count[‘Other’] +=1
print (‘Total rows:%i’ %(n+1))
print(‘Frequency of classes:’)
for ftype, freq in sorted([(t,v) for t,v in forest_type_count.items()], key=lambda x:x[1],reverse=True):
print("%-18s:%6i %04.1f%%" %(ftype, freq, freq*100/float(n+1)))

代码2：
import matplotlib.pyplot as plt
import numpy as np
proportions = np.array(proportions)
plt.figure()
plt.plot(proportions[:,0],‘r-’,label=‘Spruce/Fir’)
plt.plot(proportions[:,1],‘b-’,label=‘Lodgepole Pine’)
plt.ylim(0.0,0.8)
plt.xlabel(‘Training examples (unit=10000)’)
plt.ylabel(’%’)
plt.legend(loc=‘lower right’, numpoints= 1)
plt.show()

实验总结：
1.要关注python版本的变化，老版的书上代码可能需要更新。
2.python是一个工具，也需要借助自己肉眼直观的观察结果。
3.学会处理多种类型的数据，在上一个实验中，学会了处理平均值和标准差，本实验中，学会了分类并计算出计数和频率值。

python大规模机器学习day6-描述目标

描述目标

相关推荐