（已修改）机器学习之文本分类（附带训练集+数据集+所有代码）

本博客是我对之前博客进行的一些优化，对文件的处理，以及添加更多的注释让大家在NLP，文本分类等领域能够更快的让代码跑起来。

原文链接：https://blog.****.net/qq_28626909/article/details/80382029

关于TF-IDF，朴素贝叶斯，分词，停用词等前面的博客（原文链接开头以贴出）已经讲得非常详细了，这里就不啰嗦了，本博客是讲如何将代码跑起来，因为之前的代码是我还是个菜鸟时候写的，所以很多东西大家看不清楚，这里我贴上当时大家问的主要问题以及在本博客中提出的解决方案

1.dat文件不能查看解决方案：生成详细的txt文件，大家可以直接查看

2.不清楚生成的文件内容解决方案：生成详细的txt文件，大家可以直接查看

3.文件路径的修改（我之前没有注释）解决方案：全部替换绝对路径为相对路径，并且添加注释，让大家下载下来之后可以直接跑

4.有的同学有环境问题解决方案：博客最后会放出大多数同学出现的问题以及解决方案

文件（文件夹名称为****，进入之后的截图如下）：

（已修改）机器学习之文本分类（附带训练集+数据集+所有代码）

大多数同学用的编译器是pycharm，所以这里我将演示pycharm的运行代码

请大家将文件夹移动至pycharm中，

（已修改）机器学习之文本分类（附带训练集+数据集+所有代码）

这一个python文件我写的都是相对路径，所以大家应该不用改任何路径即可运行（只要放在一起就行）

    datapath = "./data/"  #原始数据路径
    stopWord_path = "./stop/stopword.txt"#停用词路径
    test_path = "./test/"            #测试集路径
    '''
    以上三个文件路径是已存在的文件路径，下面的文件是运行代码之后生成的文件路径
    dat文件是为了读取方便做的，txt文件是为了给大家展示做的，所以想查看分词，词频矩阵
    词向量的详细信息请查看txt文件，dat文件是通过正常方式打不开的
    '''
    test_split_dat_path =  "./test_set.dat" #测试集分词bat文件路径
    testspace_dat_path ="./testspace.dat"   #测试集输出空间矩阵dat文件
    train_dat_path = "./train_set.dat"  # 读取分词数据之后的词向量并保存为二进制文件
    tfidfspace_dat_path = "./tfidfspace.dat"  #tf-idf词频空间向量的dat文件
    '''
    以上四个为dat文件路径，是为了存储信息做的，不要打开
    '''
    test_split_path = './split/test_split/'   #测试集分词路径
    split_datapath = "./split/split_data/"  # 对原始数据分词之后的数据路径
    '''
    以上两个路径是分词之后的文件路径，大家可以生成之后自行打开查阅学习
    '''
    tfidfspace_path = "./tfidfspace.txt"  # 将TF-IDF词向量保存为txt，方便查看
    tfidfspace_arr_path = "./tfidfspace_arr.txt"  # 将TF-IDF词频矩阵保存为txt，方便查看
    tfidfspace_vocabulary_path = "./tfidfspace_vocabulary.txt"  # 将分词的词汇统计信息保存为txt，方便查看
    testSpace_path = "./testSpace.txt"  #测试集分词信息
    testSpace_arr_path = "./testSpace_arr.txt"  #测试集词频矩阵信息
    trainbunch_vocabulary_path = "./trainbunch_vocabulary.txt" #所有分词词频信息
    tfidfspace_out_arr_path = "./tfidfspace_out_arr.txt"   #tfidf输出矩阵信息
    tfidfspace_out_word_path = "./tfidfspace_out_word.txt" #单词形式的txt
    testspace_out_arr_path = "./testspace_out_arr.txt"     #测试集输出矩阵信息
    testspace_out_word_apth ="./testspace_out_word.txt"    #测试界单词信息
    '''
    以上10个文件是dat文件转化为txt文件，大家可以查询信息，这是NLP（自然语言处理）非常珍贵的资源
    '''

这段代码是对各个文件的注释，里面的内容应该算是比较详细了。下面贴出完整代码:

#!D:/workplace/python
# -*- coding: utf-8 -*-
# @File  : TFIDF_naive_bayes_wy.py
# @Author: WangYe
# @Date  : 2019/5/29
# @Software: PyCharm
# 机器学习之文本分类（附带训练集+数据集+所有代码）
# 博客链接：https://blog.****.net/qq_28626909/article/details/80382029
import jieba
from numpy import *
import pickle  # 持久化
import os
from sklearn.feature_extraction.text import TfidfTransformer  # TF-IDF向量转换类
from sklearn.feature_extraction.text import TfidfVectorizer  # TF_IDF向量生成类
from sklearn.datasets.base import Bunch
from sklearn.naive_bayes import MultinomialNB  # 多项式贝叶斯算法


def readFile(path):
    with open(path, 'r', errors='ignore') as file:  # 文档中编码有些问题，所有用errors过滤错误
        content = file.read()
        file.close()
        return content


def saveFile(path, result):
    with open(path, 'w', errors='ignore') as file:
        file.write(result)
        file.close()


def segText(inputPath, resultPath):
    fatherLists = os.listdir(inputPath)  # 主目录
    for eachDir in fatherLists:  # 遍历主目录中各个文件夹
        eachPath = inputPath + eachDir + "/"  # 保存主目录中每个文件夹目录，便于遍历二级文件
        each_resultPath = resultPath + eachDir + "/"  # 分词结果文件存入的目录
        if not os.path.exists(each_resultPath):
            os.makedirs(each_resultPath)
        childLists = os.listdir(eachPath)  # 获取每个文件夹中的各个文件
        for eachFile in childLists:  # 遍历每个文件夹中的子文件
            eachPathFile = eachPath + eachFile  # 获得每个文件路径
            #  print(eachFile)
            content = readFile(eachPathFile)  # 调用上面函数读取内容
            # content = str(content)
            result = (str(content)).replace("\r\n", "").strip()  # 删除多余空行与空格
            # result = content.replace("\r\n","").strip()

            cutResult = jieba.cut(result)  # 默认方式分词，分词结果用空格隔开
            saveFile(each_resultPath + eachFile, " ".join(cutResult))  # 调用上面函数保存文件


def bunchSave(inputFile, outputFile):
    catelist = os.listdir(inputFile)
    bunch = Bunch(target_name=[], label=[], filenames=[], contents=[])
    bunch.target_name.extend(catelist)  # 将类别保存到Bunch对象中
    for eachDir in catelist:
        eachPath = inputFile + eachDir + "/"
        fileList = os.listdir(eachPath)
        for eachFile in fileList:  # 二级目录中的每个子文件
            fullName = eachPath + eachFile  # 二级目录子文件全路径
            bunch.label.append(eachDir)  # 当前分类标签
            bunch.filenames.append(fullName)  # 保存当前文件的路径
            bunch.contents.append(readFile(fullName).strip())  # 保存文件词向量
    with open(outputFile, 'wb') as file_obj:  # 持久化必须用二进制访问模式打开
        pickle.dump(bunch, file_obj)
        # pickle.dump(obj, file, [,protocol])函数的功能：将obj对象序列化存入已经打开的file中。
        # obj：想要序列化的obj对象。
        # file:文件名称。
        # protocol：序列化使用的协议。如果该项省略，则默认为0。如果为负值或HIGHEST_PROTOCOL，则使用最高的协议版本


def readBunch(path):
    with open(path, 'rb') as file:
        bunch = pickle.load(file)
        # pickle.load(file)
        # 函数的功能：将file中的对象序列化读出。
    return bunch


def writeBunch(path, bunchFile):
    with open(path, 'wb') as file:
        pickle.dump(bunchFile, file)


def getStopWord(inputFile):
    stopWordList = readFile(inputFile).splitlines()
    return stopWordList


def getTFIDFMat(inputPath, stopWordList, outputPath,
                tftfidfspace_path,tfidfspace_arr_path,tfidfspace_vocabulary_path):  # 求得TF-IDF向量
    bunch = readBunch(inputPath)
    tfidfspace = Bunch(target_name=bunch.target_name, label=bunch.label, filenames=bunch.filenames, tdm=[],
                       vocabulary={})
    '''读取tfidfspace'''
    tfidfspace_out = str(tfidfspace)
    saveFile(tftfidfspace_path, tfidfspace_out)
    # 初始化向量空间
    vectorizer = TfidfVectorizer(stop_words=stopWordList, sublinear_tf=True, max_df=0.5)
    transformer = TfidfTransformer()  # 该类会统计每个词语的TF-IDF权值
    # 文本转化为词频矩阵，单独保存字典文件
    tfidfspace.tdm = vectorizer.fit_transform(bunch.contents)
    tfidfspace_arr = str(vectorizer.fit_transform(bunch.contents))
    saveFile(tfidfspace_arr_path, tfidfspace_arr)
    tfidfspace.vocabulary = vectorizer.vocabulary_  # 获取词汇
    tfidfspace_vocabulary = str(vectorizer.vocabulary_)
    saveFile(tfidfspace_vocabulary_path, tfidfspace_vocabulary)
    '''over'''
    writeBunch(outputPath, tfidfspace)


def getTestSpace(testSetPath, trainSpacePath, stopWordList, testSpacePath,
                 testSpace_path,testSpace_arr_path,trainbunch_vocabulary_path):
    bunch = readBunch(testSetPath)
    # 构建测试集TF-IDF向量空间
    testSpace = Bunch(target_name=bunch.target_name, label=bunch.label, filenames=bunch.filenames, tdm=[],
                      vocabulary={})
    '''
       读取testSpace
       '''
    testSpace_out = str(testSpace)
    saveFile(testSpace_path, testSpace_out)
    # 导入训练集的词袋
    trainbunch = readBunch(trainSpacePath)
    # 使用TfidfVectorizer初始化向量空间模型  使用训练集词袋向量
    vectorizer = TfidfVectorizer(stop_words=stopWordList, sublinear_tf=True, max_df=0.5,
                                 vocabulary=trainbunch.vocabulary)
    transformer = TfidfTransformer()
    testSpace.tdm = vectorizer.fit_transform(bunch.contents)
    testSpace.vocabulary = trainbunch.vocabulary
    testSpace_arr = str(testSpace.tdm)
    trainbunch_vocabulary = str(trainbunch.vocabulary)
    saveFile(testSpace_arr_path, testSpace_arr)
    saveFile(trainbunch_vocabulary_path, trainbunch_vocabulary)
    # 持久化
    writeBunch(testSpacePath, testSpace)


def bayesAlgorithm(trainPath, testPath,tfidfspace_out_arr_path,
                   tfidfspace_out_word_path,testspace_out_arr_path,
                   testspace_out_word_apth):
    trainSet = readBunch(trainPath)
    testSet = readBunch(testPath)
    clf = MultinomialNB(alpha=0.001).fit(trainSet.tdm, trainSet.label)
    # alpha:0.001 alpha 越小，迭代次数越多，精度越高
    # print(shape(trainSet.tdm))  #输出单词矩阵的类型
    # print(shape(testSet.tdm))
    '''处理bat文件'''
    tfidfspace_out_arr = str(trainSet.tdm)  # 处理
    tfidfspace_out_word = str(trainSet)
    saveFile(tfidfspace_out_arr_path, tfidfspace_out_arr)  # 矩阵形式的train_set.txt
    saveFile(tfidfspace_out_word_path, tfidfspace_out_word)  # 文本形式的train_set.txt

    testspace_out_arr = str(testSet)
    testspace_out_word = str(testSet.label)
    saveFile(testspace_out_arr_path, testspace_out_arr)
    saveFile(testspace_out_word_apth, testspace_out_word)

    '''处理结束'''
    predicted = clf.predict(testSet.tdm)
    total = len(predicted)
    rate = 0
    for flabel, fileName, expct_cate in zip(testSet.label, testSet.filenames, predicted):
        if flabel != expct_cate:
            rate += 1
            print(fileName, ":实际类别：", flabel, "-->预测类别：", expct_cate)
    print("erroe rate:", float(rate) * 100 / float(total), "%")



# 分词，第一个是分词输入，第二个参数是结果保存的路径

#
if __name__ == '__main__':
    datapath = "./data/"  #原始数据路径
    stopWord_path = "./stop/stopword.txt"#停用词路径
    test_path = "./test/"            #测试集路径
    '''
    以上三个文件路径是已存在的文件路径，下面的文件是运行代码之后生成的文件路径
    dat文件是为了读取方便做的，txt文件是为了给大家展示做的，所以想查看分词，词频矩阵
    词向量的详细信息请查看txt文件，dat文件是通过正常方式打不开的
    '''
    test_split_dat_path =  "./test_set.dat" #测试集分词bat文件路径
    testspace_dat_path ="./testspace.dat"   #测试集输出空间矩阵dat文件
    train_dat_path = "./train_set.dat"  # 读取分词数据之后的词向量并保存为二进制文件
    tfidfspace_dat_path = "./tfidfspace.dat"  #tf-idf词频空间向量的dat文件
    '''
    以上四个为dat文件路径，是为了存储信息做的，不要打开
    '''
    test_split_path = './split/test_split/'   #测试集分词路径
    split_datapath = "./split/split_data/"  # 对原始数据分词之后的数据路径
    '''
    以上两个路径是分词之后的文件路径，大家可以生成之后自行打开查阅学习
    '''
    tfidfspace_path = "./tfidfspace.txt"  # 将TF-IDF词向量保存为txt，方便查看
    tfidfspace_arr_path = "./tfidfspace_arr.txt"  # 将TF-IDF词频矩阵保存为txt，方便查看
    tfidfspace_vocabulary_path = "./tfidfspace_vocabulary.txt"  # 将分词的词汇统计信息保存为txt，方便查看
    testSpace_path = "./testSpace.txt"  #测试集分词信息
    testSpace_arr_path = "./testSpace_arr.txt"  #测试集词频矩阵信息
    trainbunch_vocabulary_path = "./trainbunch_vocabulary.txt" #所有分词词频信息
    tfidfspace_out_arr_path = "./tfidfspace_out_arr.txt"   #tfidf输出矩阵信息
    tfidfspace_out_word_path = "./tfidfspace_out_word.txt" #单词形式的txt
    testspace_out_arr_path = "./testspace_out_arr.txt"     #测试集输出矩阵信息
    testspace_out_word_apth ="./testspace_out_word.txt"    #测试界单词信息
    '''
    以上10个文件是dat文件转化为txt文件，大家可以查询信息，这是NLP（自然语言处理）非常珍贵的资源
    '''

    #输入训练集
    segText(datapath,#读入数据
            split_datapath)#输出分词结果
    bunchSave(split_datapath,#读入分词结果
              train_dat_path)  # 输出分词向量
    stopWordList = getStopWord(stopWord_path)  # 获取停用词表
    getTFIDFMat(train_dat_path, #读入分词的词向量
                stopWordList,    #获取停用词表
                tfidfspace_dat_path, #tf-idf词频空间向量的dat文件
                tfidfspace_path, #输出词频信息txt文件
                tfidfspace_arr_path,#输出词频矩阵txt文件
                tfidfspace_vocabulary_path)  #输出单词txt文件
    '''
    测试集的每个函数的参数信息请对照上面的各个信息，是基本相同的
    '''
    #输入测试集
    segText(test_path,
            test_split_path)  # 对测试集读入文件，输出分词结果
    bunchSave(test_split_path,
              test_split_dat_path)  #
    getTestSpace(test_split_dat_path,
                 tfidfspace_dat_path,
                 stopWordList,
                 testspace_dat_path,
                 testSpace_path,
                 testSpace_arr_path,
                 trainbunch_vocabulary_path)# 输入分词文件，停用词，词向量，输出特征空间(txt,dat文件都有)
    bayesAlgorithm(tfidfspace_dat_path,
                   testspace_dat_path,
                   tfidfspace_out_arr_path,
                   tfidfspace_out_word_path,
                   testspace_out_arr_path,
                   testspace_out_word_apth)

然后我们运行代码：

（已修改）机器学习之文本分类（附带训练集+数据集+所有代码）

代码的输出仍然不变，但是会生成很多文件：

（已修改）机器学习之文本分类（附带训练集+数据集+所有代码）

split文件夹中是训练集和测试集的文词文件

（已修改）机器学习之文本分类（附带训练集+数据集+所有代码）

剩下的dat文件是打不开的，但是我转成相应的txt文件了，每个文件在上面都有注释，大家针对自己想要的一一对应查阅，这是非常好的NLP的学习资源，我这里随便截取两个

（已修改）机器学习之文本分类（附带训练集+数据集+所有代码）

第一张图已经是词频矩阵了，将tfidf的值已经计算出来了，第二个是单词出现频率，详细请参考开始我放出的原博客链接

（如果你的打开有乱码，请转为GBK，记事本自动转换不用担心，pycharm请手动点击，如下图）

（已修改）机器学习之文本分类（附带训练集+数据集+所有代码）

最后，我想说一下，因为很多人可能是新手或者刚入行，我这里附上常见的一些问题，因为我当时开始学的时候也是有个大哥在帮我。

以下为同学们给我发的微信bug图片：

（已修改）机器学习之文本分类（附带训练集+数据集+所有代码）

出现这种问题是缺少包，我们可以在终端输入

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple some-package

some-package 到时候替换为缺少的模块，以上图代码为例，分别替换为 jieba，numpy ，scikit-learn

然后这里肯定有人问，终端在哪？两个办法进入终端：

1.window下按win + r ，输入cmd，然后复制上面的代码（路径无所谓）

linux下直接输入即可

2.pycharm下点这个

（已修改）机器学习之文本分类（附带训练集+数据集+所有代码）

然后输入，回车就ok了

当然还有很多其他办法，我这里就说两个比较适合新手的方法

2. （已修改）机器学习之文本分类（附带训练集+数据集+所有代码）

也有同学出现pycharm中缺少环境的，但是大家的疑问是我装过python或者 anaconda了，怎么缺少环境呢？

（已修改）机器学习之文本分类（附带训练集+数据集+所有代码）

这里我放出其他博客的链接，大家可以参考

https://blog.****.net/weixin_41923961/article/details/86584683

正常学习文件以及代码下载链接（仅有输入文件，运行后可生成输出文件，推荐大家学习使用）：

链接：https://pan.baidu.com/s/1IW6kMev17sjyPFdizsS13g
提取码：ap7m

最后啊，因为有人是给学校交作业啊什么，比较急，什么明天不交就挂科了什么的。。。我这里再放一个链接，这是我生成好的数据文件，大家可以直接交了。。。但是我不推荐啊，毕竟我都这么费劲写博客教大家怎么运行我的代码了

急着明天交作业的同学的生成文件，代码，以及运行截图（无水印）下载链接（非常不推荐，不值得学习）：

链接：https://pan.baidu.com/s/1arv3b-poyMUFxz3dcaSm5g
提取码：ofa2

由于提问评论人太多，这里我留下个人微信：wy1119744330 添加好友请备注：****博客

你们的问题我都会尽量满足，谢谢大家

最后再附上原博客链接：https://blog.****.net/qq_28626909/article/details/80382029

（已修改）机器学习之文本分类（附带训练集+数据集+所有代码）

相关推荐