对三体进行分词生成词向量
网上下载的三体TXT版本
txt文件放在“f:\test5\threebody.txt”中,分词后的文件放在“f:\test5\threebody2.txt”
使用jeba分词工具
import jieba filePath = r'f:\test5\threebody.txt' fileSegWordDonePath = r'f:\test5\threebody2.txt' fileTrainRead = [] with open(filePath, encoding='utf-8') as f: for line in f: fileTrainRead.append(line) fileTrainSeg = [] ''' print(' '.join(list(jieba.cut(fileTrainRead[100][2: -1], cut_all=False)))) ''' for i in range(len(fileTrainRead)): fileTrainSeg.append(' '.join(list(jieba.cut(fileTrainRead[i][:], cut_all=False)))) if i % 100 == 0: print(i) with open(fileSegWordDonePath, 'w', encoding='utf-8') as f: for i in range(len(fileTrainSeg)): f.write(fileTrainSeg[i]) f.write('\n')
效果如下
打开两个文件均用的“utf-8”进行编码,因为下载的txt文件也是utf-8编码的
训练词向量并保存
from gensim.models import word2vec sentence = word2vec.Text8Corpus(r'f:\test5\threebody2.txt') model = word2vec.Word2Vec(sentence) model.save(r'f:\test5\threebody2.bin')
随时调用训练好的词向量
from gensim.models import word2vec model = word2vec.Word2Vec.load(r'f:\test5\threebody2.bin') a = model.most_similar('三体') for i in a: print(i)
('地球', 0.9131160974502563) ('人类', 0.8874142169952393) ('文明', 0.8738576769828796) ('宇宙', 0.8627338409423828) ('两个', 0.8485667705535889) ('太阳系', 0.8431316018104553) ('生存', 0.840785026550293) ('技术', 0.8377254009246826) ('毁灭', 0.8264228105545044) ('舰队', 0.8234724998474121) Process finished with exit code 0