python爬虫生成词云
python爬虫生成词云
只生成词云还是比较简单的,网上教程还是比较多的,在这作为爬虫菜鸟就稍稍献丑献丑,请勿多怪
一、首先,我们需要用到的库有 jieba、matplolib、wordcloud。
jieba 是一个python实现的分词库,对中文有着很强大的分词能力。
(了解请戳 https://www.cnblogs.com/jiayongji/p/7119065.html)
Matplotlib是Python中最常用的可视化工具之一,可以非常方便地创建海量类型地2D图表和一些基本的3D图表。
(了解请戳 https://www.cnblogs.com/TensorSense/p/6802280.html)
wordcloud是基于Python的词云生成类库。
(了解请戳 https://blog.****.net/heyuexianzi/article/details/76851377)
二、上代码(借鉴 https://www.cnblogs.com/franklv/p/6995150.html)
text = open('name.txt').read()
wl = " ".join(text)
result=jieba.analyse.textrank(text,topK=100,withWeight=True)
# print result
keywords = dict()
for i in result:
keywords[i[0]] = i[1]
# print keywords
color_mask = plt.imread("a.jpg")
cloud = WordCloud(
font_path="C:\Windows\Fonts\simfang.ttf",
background_color='white',
mask=color_mask,
max_words=1000,
stopwords = STOPWORDS,
random_state = 30, # 设置有多少种随机生成状态,即有多少种配色方案
scale=.5
# max_font_size=40
)
word_cloud = cloud.generate_from_frequencies(keywords)
word_cloud.to_file("a2.png")
plt.imshow(word_cloud)
plt.axis('off')
plt.show()
三、注意
先准备一张背景图片,这张背景图片呢,要类似于这样,最好背景是空白的,这样才会有轮廓呦!当然,ps大神那就没什么顾虑啦,换个背景就行,可是不会的就自己翻翻找找啦。右图就是生成的词云图。
四、还有还有,这个词云的词语来源是闺蜜的空间说说呦(借鉴的人家的代码呦)
代码代码
# -*- coding:utf-8 -*- import time from selenium import webdriver from lxml import etree import sys reload(sys) sys.setdefaultencoding( "utf-8" ) driver = webdriver.Firefox() driver.get("http://i.qq.com") driver.maximize_window() user_agent = 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0' headers = {'User-Agent': user_agent} driver.switch_to.frame("login_frame") driver.find_element_by_id("switcher_plogin").click() time.sleep(2) driver.find_element_by_id("u").send_keys('你的账号') driver.find_element_by_id("p").send_keys("你的密码") driver.find_element_by_id("login_button").click() time.sleep(2) driver.switch_to.default_content() driver.get("http://user.qzone.qq.com/" + "朋友qq" +"/311") next_num = 0 while True: for i in range(1,6): height = 20000*i strWord = "window.scrollBy(0,"+str(height)+")" driver.execute_script(strWord) time.sleep(4) driver.switch_to.frame("app_canvas_frame") selector = etree.HTML(driver.page_source) divs = selector.xpath('//*[@id="msgList"]/li/div[3]') with open('qq_word.txt','a') as f: for div in divs: qq_name = div.xpath('./div[2]/a/text()') qq_content = div.xpath('./div[2]/pre/text()') qq_time = div.xpath('./div[4]/div[1]/span/a/text()') qq_name = qq_name[0] if len(qq_name)>0 else '' qq_content = qq_content[0] if len(qq_content)>0 else '' qq_time = qq_time[0] if len(qq_time)>0 else '' print(qq_name,qq_time,qq_content) f.write(qq_content+"\n") if driver.page_source.find('pager_next_' + str(next_num)) == -1: break driver.find_element_by_id('pager_next_' + str(next_num)).click() next_num += 1 driver.switch_to.parent_frame()
注意注意这个frame有些麻烦,可以试试这几种用法
driver.switch_to.frame(0) # 1.用frame的index来定位,第一个是0
driver.switch_to.frame("frame1") # 2.用id来定位
driver.switch_to.frame("myframe") # 3.用name来定位
driver.switch_to.frame(driver.find_element_by_tag_name("iframe")) # 4.用WebElement对象来定位