Python+Selenium多线程基础微博爬虫
一、随便扯扯的概述
大家好,虽然我自上大学以来就一直在关注着****,在这上面学到了很多知识,可是却从来没有发过博客(还不是因为自己太菜,什么都不会),这段时间正好在机房进行期末实训,我们组做的是一个基于微博信息的商品推荐系统,不说这个系统是不是真的可行可用,有价值,我们的主要目的只是通过构建这个系统来学习更多的技术,新学一门语言(额,原谅我现在才开始学Python)。
好,废话不多说,这次我主要是想记录一下我在这个项目中的进展,纯属是想做个日志之类的东西留个纪念吧,我虽然都已经大三了,但还是个菜鸟,其中在爬取微博内容部分引用了某位大神的代码https://blog.****.net/d1240673769/article/details/74278547,希望各位大神多多给出意见和建议。
这篇文章主要是讲我如何通过selenium这个工具来实现通过模拟浏览器搜索微博用户昵称,进入用户微博主页,并将内容保存到本地,其中也顺带着把用户的微博头像保存了。
二、环境配置
1.首先我安装的环境是python3.6,使用的IDE是pycharm,在pycharm中可以直接安装所需要的selenium和webdriver等等一系列的package。
如果需要导入相关的package,建立了项目之后,点击File -> settings -> Project: “项目名称” -> Project Interperter,如下图所示:
接下来在右侧双击击pip,进入所有Package界面,搜索所需要的package,点击install package即可:
这里可以同时安装很多个,选完之后可以直接将窗口叉掉,然后点击OK,程序会在后台进行安装。
安装完成后,pycharm下方会有提示。
2. 下载chromdriver,进入http://npm.taobao.org/mirrors/chromedriver/,通过查看notes.txt下载与自己的chrome浏览器相对应的chromedriver。
下载之后将解压包直接复制到项目目录下,例如我这里直接复制到:
3.下面开始编写程序。爬取微博我这里使用的是m站的微博,通过构造https://m.weibo.cn/u/“用户的OID”来直接访问用户的所有微博内容,无需登录。如果通过访问wap站的话,每个人的微博主页地址可以更改,规律难寻,技术水平有限。
整个爬虫的具体思路如下:
模拟浏览器访问https://weibo.com -> 通过搜索框搜索微博用户昵称 -> 切换到找人页面 -> 爬取用户微博主页地址 并访问 -> 爬取用户oid -> 访问https://m.weibo.cn/u/'oid' -> 正则表达式匹配内容并抓取。
首先我们来构造OidSpider类,引入相关package:
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from bs4 import BeautifulSoup import re from pyquery import PyQuery as pq
定义driver,访问https://www.weibo.com,通过定位器在搜索框输入用户微博昵称,定位器selector获得如下图,
代码如下:
self.driver = webdriver.Chrome() self.wait = WebDriverWait(self.driver, 10) self.driver.get("https://www.weibo.com/") input = self.wait.until(EC.presence_of_element_located( (By.CSS_SELECTOR, "#weibo_top_public > div > div > div.gn_search_v2 > input"))) submit = self.wait.until( EC.element_to_be_clickable((By.CSS_SELECTOR, "#weibo_top_public > div > div > div.gn_search_v2 > a"))) input.send_keys(self.nickName)
然后同样的方法定位搜索按钮点击搜索,再通过定位器切换到找人界面:
submit = self.wait.until( EC.element_to_be_clickable((By.CSS_SELECTOR, "#weibo_top_public > div > div > div.gn_search_v2 > a"))) submit.click() submit = self.wait.until(EC.element_to_be_clickable( (By.CSS_SELECTOR, '#pl_common_searchTop > div.search_topic > div > ul > li:nth-child(2) > a'))) submit.click()
接下来通过正则表达式匹配获取用户微博主页url
html = self.driver.page_source doc = pq(html) return (re.findall(r'a target="_blank"[\s\S]href="(.*)"[\s\S]title=', str(doc))[0])
访问用户微博主页url,通过正则表达式匹配用户oid
self.driver.get('HTTPS:'+url) html = self.driver.page_source soup = BeautifulSoup(html, 'lxml') script = soup.head.find_all('script') self.driver.close() return (re.findall(r"'oid']='(.*)'", str(script))[0])
接下来进行WeiboSpider类的构建,引入相关package
from selenium import webdriver import urllib.request import json from selenium.webdriver.support.ui import WebDriverWait
构造请求头
req = urllib.request.Request(url) req.add_header("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0") proxy = urllib.request.ProxyHandler({'http': self.__proxyAddr}) opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler) urllib.request.install_opener(opener) data = urllib.request.urlopen(req).read().decode('utf-8', 'ignore') return data
通过xpath找到微博用户头像(xpath的用法跟selecor类似,但功能比selector更强大),然后直接保存在本地
self.driver.get("https://m.weibo.cn/u/" + self.oid) src = WebDriverWait(self.driver, 10).until(lambda driver: self.driver.find_element_by_xpath('//*[@id="app"]/div[1]/div[2]/div[1]/div/div[2]/span/img')) imgurl = src.get_attribute('src') urllib.request.urlretrieve(imgurl, 'D://微博用户头像/'+nickName+'.jpg') self.driver.get(imgurl)
然后后循环抓取微博内容,写到txt中
while True: weibo_url = 'https://m.weibo.cn/api/container/getIndex?type=uid&value=' + self.oid + '&containerid=' + self.searchContainerId(url) + '&page=' + str(i) try: data = self.constructProxy(weibo_url) content = json.loads(data).get('data') cards = content.get('cards') if (len(cards) > 0): for j in range(len(cards)): print("-----正在爬取第" + str(i) + "页,第" + str(j) + "条微博------") card_type = cards[j].get('card_type') if (card_type == 9): mblog = cards[j].get('mblog') attitudes_count = mblog.get('attitudes_count') comments_count = mblog.get('comments_count') created_at = mblog.get('created_at') reposts_count = mblog.get('reposts_count') scheme = cards[j].get('scheme') text = mblog.get('text') with open(nickName+'.txt', 'a', encoding='utf-8') as fh: fh.write("----第" + str(i) + "页,第" + str(j) + "条微博----" + "\n") fh.write("微博地址:" + str(scheme) + "\n" + "发布时间:" + str( created_at) + "\n" + "微博内容:" + text + "\n" + "点赞数:" + str( attitudes_count) + "\n" + "评论数:" + str(comments_count) + "\n" + "转发数:" + str( reposts_count) + "\n") i += 1 else: break except Exception as e: print(e)
当然最后不能忘了关闭driver
self.driver.close()
接下来到多线程,多线程其实比较简单,python3和python2有些许区别,这里推荐使用python3里的threading
from oidspider import OidSpider from weibospider import WeiboSpider from threading import Thread class MultiSpider: userList=None threadList=[] def __init__(self, userList): self.userList=userList def weiboSpider(self,nickName): oidspider = OidSpider(nickName) url = oidspider.constructURL() oid = oidspider.searchOid(url) weibospider = WeiboSpider(oid) weibospider.searchWeibo(nickName) def mutiThreads(self): for niName in self.userList: t=Thread(target=self.weiboSpider,args=(niName,)) self.threadList.append(t) for threads in self.threadList: threads.start()
以下是完整代码:
####################################################### # # OidSpider.py # Python implementation of the Class OidSpider # Generated by Enterprise Architect # Created on: 20-六月-2018 10:27:14 # Original author: McQueen # ####################################################### from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from bs4 import BeautifulSoup import re from pyquery import PyQuery as pq class OidSpider: """ 用户微博ID爬取 使用selenium模拟浏览器操作,通过搜索用户微博昵称找到用户, 爬取网页重定向所需要的微博用户主页URL地址,进入主页后对HTML代码分析, 找到并爬取用户微博的ID nickName: 微博昵称 driver: 浏览器驱动 wait: 模拟浏览器进行操作时所需等待的时间 """ nickName=None driver=None wait=None def __init__(self, nickName): """初始化Oid爬虫 根据用户输入的nickName进行初始化 """ self.nickName=nickName def constructURL(self): """构造URL 模拟浏览器搜索用户微博昵称,分析需要跳转到用户微博主页的URL地址 返回值为用户微博主页的URL地址 """ self.driver = webdriver.Chrome() self.wait = WebDriverWait(self.driver, 10) self.driver.get("https://www.weibo.com/") input = self.wait.until(EC.presence_of_element_located( (By.CSS_SELECTOR, "#weibo_top_public > div > div > div.gn_search_v2 > input"))) submit = self.wait.until( EC.element_to_be_clickable((By.CSS_SELECTOR, "#weibo_top_public > div > div > div.gn_search_v2 > a"))) input.send_keys(self.nickName) submit.click() submit = self.wait.until(EC.element_to_be_clickable( (By.CSS_SELECTOR, '#pl_common_searchTop > div.search_topic > div > ul > li:nth-child(2) > a'))) submit.click() html = self.driver.page_source doc = pq(html) return (re.findall(r'a target="_blank"[\s\S]href="(.*)"[\s\S]title=', str(doc))[0]) def searchOid(self, url): """爬取用户Oid 分析用户微博主页HTML代码,抓取用户ID url: 用户微博主页的URL地址 返回值为用户的ID """ self.driver.get('HTTPS:'+url) html = self.driver.page_source soup = BeautifulSoup(html, 'lxml') script = soup.head.find_all('script') self.driver.close() return (re.findall(r"'oid']='(.*)'", str(script))[0])
####################################################### # # WeiboSpider.py # Python implementation of the Class WeiboSpider # Generated by Enterprise Architect # Created on: 20-六月-2018 10:55:18 # Original author: McQueen # ####################################################### from selenium import webdriver import urllib.request import json from selenium.webdriver.support.ui import WebDriverWait class WeiboSpider: """初始化微博爬虫并根据Oid构造加载微博用户信息和微博内容的xhr oid: 用户ID url: m站用来加载微博用户的xhr driver: 浏览器驱动器 """ __proxyAddr = "122.241.72.191:808" oid=None url=None driver=None def __init__(self, oid): self.oid=oid self.url = 'https://m.weibo.cn/api/container/getIndex?type=uid&value=' + oid self.driver=webdriver.Chrome() def constructProxy(self,url): """构造代理 构造请求包,获取微博用户的xhr信息 返回值为xhr信息 """ req = urllib.request.Request(url) req.add_header("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0") proxy = urllib.request.ProxyHandler({'http': self.__proxyAddr}) opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler) urllib.request.install_opener(opener) data = urllib.request.urlopen(req).read().decode('utf-8', 'ignore') return data def searchContainerId(self,url): """构造用户信息的xhr地址 url: 需要进行分析的URL地址 返回值为xhr地址 """ data = self.constructProxy(url) content = json.loads(data).get('data') for data in content.get('tabsInfo').get('tabs'): if (data.get('tab_type') == 'weibo'): containerid = data.get('containerid') return containerid def searchWeibo(self, nickName): """爬取微博内容,存为文本文档 对每一个用户微博内容的xhr信息进行分析,爬取用户的微博内容,并将内容输出到txt文件中 使用selenium的xpath进行用户微博头像的定位,并将用户头像下载到本地 nickName: 用户微博昵称 """ i = 1 self.driver.get("https://m.weibo.cn/u/" + self.oid) src = WebDriverWait(self.driver, 10).until(lambda driver: self.driver.find_element_by_xpath('//*[@id="app"]/div[1]/div[2]/div[1]/div/div[2]/span/img')) imgurl = src.get_attribute('src') urllib.request.urlretrieve(imgurl, 'D://微博用户头像/'+nickName+'.jpg') self.driver.get(imgurl) url=self.url while True: weibo_url = 'https://m.weibo.cn/api/container/getIndex?type=uid&value=' + self.oid + '&containerid=' + self.searchContainerId(url) + '&page=' + str(i) try: data = self.constructProxy(weibo_url) content = json.loads(data).get('data') cards = content.get('cards') if (len(cards) > 0): for j in range(len(cards)): print("-----正在爬取第" + str(i) + "页,第" + str(j) + "条微博------") card_type = cards[j].get('card_type') if (card_type == 9): mblog = cards[j].get('mblog') attitudes_count = mblog.get('attitudes_count') comments_count = mblog.get('comments_count') created_at = mblog.get('created_at') reposts_count = mblog.get('reposts_count') scheme = cards[j].get('scheme') text = mblog.get('text') with open(nickName+'.txt', 'a', encoding='utf-8') as fh: fh.write("----第" + str(i) + "页,第" + str(j) + "条微博----" + "\n") fh.write("微博地址:" + str(scheme) + "\n" + "发布时间:" + str( created_at) + "\n" + "微博内容:" + text + "\n" + "点赞数:" + str( attitudes_count) + "\n" + "评论数:" + str(comments_count) + "\n" + "转发数:" + str( reposts_count) + "\n") i += 1 else: break except Exception as e: print(e) self.driver.close()
from oidspider import OidSpider from weibospider import WeiboSpider from threading import Thread class MultiSpider: userList=None threadList=[] def __init__(self, userList): self.userList=userList def weiboSpider(self,nickName): oidspider = OidSpider(nickName) url = oidspider.constructURL() oid = oidspider.searchOid(url) weibospider = WeiboSpider(oid) weibospider.searchWeibo(nickName) def mutiThreads(self): for niName in self.userList: t=Thread(target=self.weiboSpider,args=(niName,)) self.threadList.append(t) for threads in self.threadList: threads.start()
from MultiSpider import MultiSpider def main(): list=['孟美岐','吴宣仪','杨超越','紫宁'] multispider=MultiSpider(list) multispider.mutiThreads() if __name__ == '__main__': main()
好啦,现在我们就可以爬取各位小姐姐的微博和头像了,下面就是爬取到的内容