k近邻算法的BUG！！！！！！

搞了三天的K近邻算法，好多坑让我折服........首先python2和python3的语法让我欲仙欲死.....

写个博客给自己留下纪念。

一、kNN算法的工作原理

官方解释：存在一个样本数据集，也称作训练样本集，并且样本中每个数据都存在标签，即我们知道样本集中每一数据与所属分类的对应关系，输入没有标签的新数据后，将新数据的每个特征与样本集中的数据对应的特征进行比较，然后算法提取样本集中特征最相似的数据（最近邻）的分类标签。一般来说，我们只选择样本集中前k个最相似的数据，这就是k-近邻算法中k的出处，通常k是不大于20的整数，最后，选择k个最相似的数据中出现次数最多的分类，作为新数据的分类。

我的理解：其实就是分类，找到左右邻居，判断自己是否和左右邻居相似，从而分别出自己属于哪一类。

偏题了坑开始来了

一、首先使用python导入数据前先建立名为KNN.py的Python模块，这个模块实际就是建立一个txt文本里面写上代码然后重命名为.py文件。（刚开始真不知道模块是这样来的）。。。注意，一切的程序都是在这个文档里操作的包括写程序。

二、方便接下来的操作，必须为python环境配置NumPy函数库。

1）下载numpy包

下载地址：https://pypi.python.org/pypi/numpy/#downloads

自己的是python3.6, 64位操作系统，所以选择numpy-1.11.2+mkl-cp36-cp36m-win-a

md64.whl

2）安装numpy

将下载的包拷贝到python安装目录下D:\Python\Scripts

执行 pip install "numpy-1.11.2+mkl-cp36-cp36m-win-amd64.whl"

k近邻算法的BUG！！！！！！

看着像要升级pip啥的，按照提示，进入python安装目录

输入 python -m pip install --upgrade pip

等一会后，升级成功

再执行pip install "numpy-1.11.2+mkl-cp36-cp36m-win-amd64.whl"

再等一会，安装成功

k近邻算法的BUG！！！！！！

三、安装Matplotlib制作原始数据的散点图

推荐在线下载方式，最为简单

进入cmd打开命令窗口

执行python -m pip install -U pip setuptools进行升级。接着键入python -m pip install matplotlib进行自动的安装，系统会自动下载安装包。安装完成后，可以用python -m pip list查看本机的安装的所有模块，确保matplotlib已经安装成功。

四、写算法。

先写一个测试例子，看自己的python是否安装成功

from numpy import *
import operator
import importlib

def createDataSet():
group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
labels = ['A','A','B','B']
return group,labels

在IDLE上执行

k近邻算法的BUG！！！！！！

成功了，

进入正式的算法

K-近邻算法：

程序清单1-1

def classify0(inX,dataSet,labels,k):
dataSetSize = dataSet.shape[0]
diffMat = tile(inX,(dataSetSize,1)) - dataSet
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis = 1)
distances = sqDistances**0.5
sortedDistIndicies = distances.argsort()
classCount = {}
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
return sortedClassCount[0][0]

其中注意机器学习中的K近邻算法实战中给的改进约会网站例子的算法中sortedClassCount = sorted(classCount.itemgetter(),key=operator.itemgetter(1),reverse=True) 中，itemgetter()应该换成item(),不然会报错！

程序清单1-2：

def file2matrix(filename):
fr = open(filename)
numberOfLines = len(fr.readlines())
returnMat = zeros((numberOfLines, 3))
classLabelVector = []
fr = open(filename)
index = 0
for line in fr.readlines():
line = line.strip()
listFromLine = line.split('\t')
returnMat[index, :] = listFromLine[0:3]
classLabelVector.append(int(listFromLine[-1]))
index += 1
return returnMat, classLabelVector

在这里的注意点，树上写的是<<reload(kNN)。但是你这样写不行，得这样写import importlib importlib.reload(kNN)才行。

然后测试一下：

k近邻算法的BUG！！！！！！

Matplotlib分析数据：

>>>import matplotlib

>>>import matplotlib.pyplot as plt

>>>fig=plt.figure()

>>>ax=fig.add_subplot(111)

>>>ax.scatter(datingDataMat[:,1]datingDataMat[:,2])

>>>plt.show()

k近邻算法的BUG！！！！！！

程序清单1-3（归一化特征值）

def autoNorm(dataSet):
minVals = dataSet.min(0)
maxVals = dataSet.max(0)
ranges = maxVals - minVals
normDataSet = zeros(shape(dataSet))
m = dataSet.shape[0]
normDataSet = dataSet - tile(minVals, (m, 1))
normDataSet = normDataSet / tile(ranges, (m, 1))
return normDataSet, ranges, minVals

程序清单1-4 网站测试代码

def datingClassTest():
hoRatio = 0.10
datingDataMat, datingLabels = file2matrix('datingTestSet.txt')
normMat, ranges, minVals = autoNorm(datingDataMat)
m = normMat.shape[0]
numTestVecs = int(m * hoRatio)
errorCount = 0.0
for i in range(numTestVecs):
classifierResult = classify0(normMat[i, :], normMat[numTestVecs:m, :], datingLabels[numTestVecs:m], 3)
print ("the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i]))
if (classifierResult != datingLabels[i]): errorCount += 1.0
print ("the total error rate is: %f" % (errorCount / float(numTestVecs)) )

k近邻算法的BUG！！！！！！

程序清单1-4：

def classifyPerson():
resultList = ['not at all','in small doses', 'in larfe doses']
percentTats = float(input("percentage of time playing video games?"))
ffMiles = float(input("frequent flier miles earned per years?"))
iceCream = float(input("liters of ice cream consumed per years?"))
datingDataMat,datingLabels = file2matrix('datingTestSet.txt')
normMat,ranges,minVals = autoNorm(datingDataMat)
inArr = array([ffMiles,percentTats,iceCream])
classifierResult = classify0((inArr-minVals)/ranges,normMat,datingLabels,3)
print( "you will probably like this person: ",resultList[classifierResult -1])

k近邻算法的BUG！！！！！！

至此结束。

k近邻算法的BUG！！！！！！

相关推荐