小白python学习——机器学习篇——朴素贝叶斯算法
一.大概思路:
1.找出数据集合,所有一个单词的集合,不重复,各个文档。
2.把每个文档换成0,1模型,出现的是1,就可以得到矩阵长度一样的各个文档。
3.计算出3个概率,一是侮辱性的文档概率,二是侮辱性文档中各个词出现的概率,三是非侮辱性文档中各个词出现的概率。
4.二、三计算方法,遍历0,1文档,同一类型加起来除以n*sun(set),得到一个矩阵,里面是各个词语的概率,没有出现就是0.
二.代码实现:
import numpy as np import math def loadDataSet(): postingList=[['my','dog','has','flea','problems','help','please'], ['maybe','not','take','him','to','dog','park','stupid'], ['my','dalmation','is','so','cute','I','love','him'], ['stop','posting','stupid','worthless','garbage'], ['mr','licks','ate','my','steak','how','to','stop','him'], ['quit','buying','worthless','dog','food','stupid']] classVec=[0,1,0,1,0,1] return postingList,classVec #输入词表 def createVocabList(dataSet): vocabSet=set([]) for document in dataSet: vocabSet=vocabSet | set(document) return list(vocabSet) #返回列表 #把词表转为一个矩阵,一个数据集 def setOfWords2Vec(vocabList,inputSet): returnVec = [0]*len(vocabList) for word in inputSet: if word in vocabList: returnVec[vocabList.index(word)] = 1 else: print("The word is not in Vocabulary") return returnVec #出现为1,没出先为0,建立0,1矩阵 def trainNB0(trainMatrix,trainCategory): numTrainDocs = len(trainMatrix) numWords = len(trainMatrix[0]) pAbusive = sum(trainCategory)/float(numTrainDocs) p0Num = np.zeros(numWords) p1Num = np.zeros(numWords) p0Denom =2.0 p1Denom =2.0 for i in range(numTrainDocs): if trainCategory[i] == 1: p1Num += trainMatrix[i] p1Denom +=sum(trainMatrix[i]) else: p0Num += trainMatrix[i] p0Denom +=sum(trainMatrix[i]) p1Vect = p1Num/p1Denom p0Vect = p0Num/p0Denom return p0Vect,p1Vect,pAbusive #计算概率 def classifyNB(vec2Classify,p0Vec,p1Vec,pClass1): p1=sum(vec2Classify*p1Vec) + math.log(pClass1) p0=sum(vec2Classify*p0Vec) + math.log(1.0-pClass1) if p1>p0: return 1 else: return 0 #比较概率大小 def testingNB(ceshi): postingList, classVec = loadDataSet() new_list = createVocabList(postingList) train = [] for i in postingList: train.append(setOfWords2Vec(new_list, i)) p0Vect, p1Vect, pAbusive = trainNB0(train, classVec) ce = setOfWords2Vec(new_list, ceshi) print(classifyNB(ce, p0Vect, p1Vect, pAbusive)) #最终究极合成测试函数
测试结果: