小白python学习——机器学习篇——朴素贝叶斯算法

一.大概思路:

1.找出数据集合,所有一个单词的集合,不重复,各个文档。

2.把每个文档换成0,1模型,出现的是1,就可以得到矩阵长度一样的各个文档。

3.计算出3个概率,一是侮辱性的文档概率,二是侮辱性文档中各个词出现的概率,三是非侮辱性文档中各个词出现的概率。

4.二、三计算方法,遍历0,1文档,同一类型加起来除以n*sun(set),得到一个矩阵,里面是各个词语的概率,没有出现就是0.

二.代码实现:

 

import numpy as np
import math
def loadDataSet():
    postingList=[['my','dog','has','flea','problems','help','please'],
                 ['maybe','not','take','him','to','dog','park','stupid'],
                 ['my','dalmation','is','so','cute','I','love','him'],
                 ['stop','posting','stupid','worthless','garbage'],
                 ['mr','licks','ate','my','steak','how','to','stop','him'],
                 ['quit','buying','worthless','dog','food','stupid']]
    classVec=[0,1,0,1,0,1]
    return  postingList,classVec
#输入词表
def createVocabList(dataSet):
    vocabSet=set([])
    for document in dataSet:
        vocabSet=vocabSet | set(document)
    return list(vocabSet)   #返回列表
#把词表转为一个矩阵,一个数据集
def setOfWords2Vec(vocabList,inputSet):
    returnVec  = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] = 1
        else:
            print("The word is not in Vocabulary")
    return returnVec
#出现为1,没出先为0,建立0,1矩阵
def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = sum(trainCategory)/float(numTrainDocs)
    p0Num = np.zeros(numWords)
    p1Num = np.zeros(numWords)
    p0Denom =2.0
    p1Denom =2.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom +=sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom +=sum(trainMatrix[i])
    p1Vect = p1Num/p1Denom
    p0Vect = p0Num/p0Denom
    return p0Vect,p1Vect,pAbusive
#计算概率
def classifyNB(vec2Classify,p0Vec,p1Vec,pClass1):
    p1=sum(vec2Classify*p1Vec) + math.log(pClass1)
    p0=sum(vec2Classify*p0Vec) + math.log(1.0-pClass1)
    if p1>p0:
        return 1
    else:
        return 0
#比较概率大小
def testingNB(ceshi):
    postingList, classVec = loadDataSet()
    new_list = createVocabList(postingList)
    train = []
    for i in postingList:
        train.append(setOfWords2Vec(new_list, i))
    p0Vect, p1Vect, pAbusive = trainNB0(train, classVec)
    ce = setOfWords2Vec(new_list, ceshi)
    print(classifyNB(ce, p0Vect, p1Vect, pAbusive))
#最终究极合成测试函数

测试结果:

小白python学习——机器学习篇——朴素贝叶斯算法