python3使用requests模块的get方法做爬虫（伪装浏览器）

获取网页对象可以使用两种方法：

使用urllib模块的urlopen方法：

import urllib

reponse=urllib.urlopen("http://www.itcast.cn")

print(reponse.read())

reponse.read()：打开网页源代码。

reponse.getcode():获取http状态码：200表示请求完成，404表示网址找不到

一种直接导入requests模块的get方法：

import requests

res=request.get(url)#获取网页对象

result=res.text()#获取网页源代码

然后自己根据正则就可以获取到自己想要的网页内容

2伪装浏览器：

打开自己要爬取的网页，按下F12或者右击网页检查，点击network再刷新一次点击获取到的内容，点击header获取头部内容，里面有你的浏览器版本等等：

这就是全过程，也可以利用工具fiddler抓包获取，不过要在浏览器配置代理服务器。

把获取到的头部信息放到你获取网页的get方法里面作为参数：

send_headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36",
              "Connection":"keep-alive",
              "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
              "Accept-Language":"zh-CN,zh;q=0.8"}

res=requests.get(FirstUrl+url,headers=send_headers)   #在地址前面加上地址补充完整

这能够解决网页防爬虫和封ip的行为，记录一下自己的学习，顺便看看能不能帮到别人

python3使用requests模块的get方法做爬虫（伪装浏览器）

相关推荐