python大规模机器学习day6-描述目标
描述目标
实验要求:
1.学习分类计数及其频率值计算
实验内容:
1.研究森林覆盖值,用分类的思想处理它
代码注释:
import os,csv
local_path=os.getcwd()
source=‘covtype.data’
SEP=’,’
forest_type={1:“Spruce/Fir”, 2:“Lodgepole Pine”, 3:“Ponderosa Pine”, 4:“Cottonwood/Willow”, 5:“Aspen”, 6:“Douglas-fir”, 7:“Krummholz”} //森林覆盖类型,数据集中的最后一个数字代表树品种,前面的数据代表
forest_type_count={value:0 for value in forest_type.values()} //设置value的初始值,将每一个品种的数量设置为0
forest_type_count[‘Other’]=0
lodgepole_pine=0
spruce=0
proportions=list() //声明一个列表
with open(local_path+’\’+source,‘rt’) as R:
iterator = csv.reader(R, delimiter=SEP)
for n ,row in enumerate(iterator):
response=int(row[-1]) #The response is the last value //取每一行最后一个数字
try:
forest_type_count[forest_type[response]] +=1 //设置一个叠加器来动态记录每一个品种的数量
if response==1:
spruce +=1
elif response ==2:
lodgepole_pine +=1
if n % 10000 ==0:
proportions.append([spruce/float(n+1),lodgepole_pine/float(n+1)]) //此处操作是在人工观测可见的结果之后,取10000个实例的数据,方便画图
except:
forest_type_count[‘Other’] +=1 //如果末尾数字为其他数字,则归为other类,本案例中other类中value为0
print (‘Total rows:%i’ %(n+1))
print(‘Frequency of classes:’) //种类频繁度
for ftype, freq in sorted([(t,v) for t,v in forest_type_count.items()], key=lambda x:x[1],reverse=True): //sorted函数进行排序操作,reverse为true按从大到小排序,items可以遍历键值对序列并输出
print("%-18s:%6i %04.1f%%" %(ftype, freq, freq*100/float(n+1)))
代码2:
import matplotlib.pyplot as plt
import numpy as np
proportions = np.array(proportions) //将列表转化为二维数组
plt.figure()
plt.plot(proportions[:,0],‘r-’,label=‘Spruce/Fir’)
plt.plot(proportions[:,1],‘b-’,label=‘Lodgepole Pine’)
plt.ylim(0.0,0.8)
plt.xlabel(‘Training examples (unit=10000)’)
plt.ylabel(’%’)
plt.legend(loc=‘lower right’, numpoints= 1)
plt.show()
运行截图:
源代码:
import os,csv
local_path=os.getcwd()
source=‘covtype.data’
SEP=’,’
forest_type={1:“Spruce/Fir”, 2:“Lodgepole Pine”, 3:“Ponderosa Pine”, 4:“Cottonwood/Willow”, 5:“Aspen”, 6:“Douglas-fir”, 7:“Krummholz”}
forest_type_count={value:0 for value in forest_type.values()}
forest_type_count[‘Other’]=0
lodgepole_pine=0
spruce=0
proportions=list()
with open(local_path+’\’+source,‘rt’) as R:
iterator = csv.reader(R, delimiter=SEP)
for n ,row in enumerate(iterator):
response=int(row[-1]) #The response is the last value
try:
forest_type_count[forest_type[response]] +=1
if response==1:
spruce +=1
elif response ==2:
lodgepole_pine +=1
if n % 10000 ==0:
proportions.append([spruce/float(n+1),lodgepole_pine/float(n+1)])
except:
forest_type_count[‘Other’] +=1
print (‘Total rows:%i’ %(n+1))
print(‘Frequency of classes:’)
for ftype, freq in sorted([(t,v) for t,v in forest_type_count.items()], key=lambda x:x[1],reverse=True):
print("%-18s:%6i %04.1f%%" %(ftype, freq, freq*100/float(n+1)))
代码2:
import matplotlib.pyplot as plt
import numpy as np
proportions = np.array(proportions)
plt.figure()
plt.plot(proportions[:,0],‘r-’,label=‘Spruce/Fir’)
plt.plot(proportions[:,1],‘b-’,label=‘Lodgepole Pine’)
plt.ylim(0.0,0.8)
plt.xlabel(‘Training examples (unit=10000)’)
plt.ylabel(’%’)
plt.legend(loc=‘lower right’, numpoints= 1)
plt.show()
实验总结:
1.要关注python版本的变化,老版的书上代码可能需要更新。
2.python是一个工具,也需要借助自己肉眼直观的观察结果。
3.学会处理多种类型的数据,在上一个实验中,学会了处理平均值和标准差,本实验中,学会了分类并计算出计数和频率值。