通过ajax接口爬取智联招聘
上次分析了抓取智联招聘网站遇到的坑,最后使用selenium模拟的方法抓取到了智联的招聘数据,但是我发现通过分析智联招聘的ajax接口,模拟ajax请求直接获取json数据这种方法更简单。
分析网页ajax接口:在搜索框输入python,点击搜索,打开浏览器开发者模式,点击network,点击XHR过滤出来ajax请求
第一页:
这是搜索之后第一页的信息,可以看到很多参数,直接观察好像并不能查看出哪些参数对我们有用,那我们接着点击下一页继续分析
这是点击到第二页的参数,可以看到相比于第一页,这次的参数多出来一个start参数,数据为60。接着点击第三页,继续查看它的参数。
可以看到这里start参数变成了120,不难看出,start参数是控制页面每次显示的数据从第几条开始,pageSize为每页显示多少条数据,固定为60。到这里就已经明确可以通过urlencode将参数拼接,通过控制start的参数传入来控制我们要抓取哪些页的数据。
分析Preview:
通过对Preview的观察可以发现,返回的json数据中有一个data字段,data中包含有results字段,点击results字段可以发现网页上的工作名称,工作薪资,工作详情页的地址都在其中,那么我们就可以通过response.json方法获取到网页返回的json数据,从中解析出我们想要的数据。
代码思路:1.分析网页的ajax接口,需要传入哪些参数2.通过控制传入的start参数来控制要爬取的哪些页,生成需要爬取的url列表3.请求这些url,解析出工作详情页4.请求工作详情页,解析出工作的详细信息5.保存到数据库
完整代码:
import time
from datetime import datetime
import requests
import lxml.html
from urllib.parse import urlencode, urlparse
import random
from retrying import retry
import sqlite3
db = sqlite3.connect("zhilian.db")
cursor = db.cursor()
class Throttle:
"""
下载限速器
"""
def __init__(self,delay):
self.domains = {}
self.delay = delay
def wait_url(self,url_str):
domain_url = urlparse(url_str).netloc
last_accessed = self.domains.get(domain_url)
if self.delay > 0 and last_accessed is not None:
sleep_interval = self.delay - (datetime.now() - last_accessed).seconds
if sleep_interval > 0:
time.sleep(sleep_interval + round(random.uniform(1,3),1))
self.domains[domain_url] = datetime.now()
User_Agent = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50"
]
# 查看ajax请求头参数
headers = {
"Host":"fe-api.zhaopin.com",
"Origin":"https://sou.zhaopin.com",
"Referer":"https://sou.zhaopin.com/?p=3&pageSize=60&jl=489&kw=python&kt=3",
"User-Agent":"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50"
}
throttle = Throttle(3)
def get_page(start,page):
"""
分析ajax请求的参数并请求ajax接口
:param start: ajax实现分页的参数,从第几条数据开始
:param page: 偏移数据量对应的页码
:return:
"""
params = {
"start":str(start),
"pageSize":"60",
"cityId":"489",
"workExperience":"-1",
"education":"-1",
"companyType":"-1",
"employmentType":"-1",
"jobWelfareTag":"-1",
"kw":"python",
"kt":"3",
"lastUrlQuery":{"p":str(page),"pageSize":"60","jl":"489","kw":"python","kt":"3"}
}
url = "https://fe-api.zhaopin.com/c/i/sou?" + urlencode(params)
result_json = download_retry(url)
return result_json
@retry(stop_max_attempt_number=4)
def download_retry(url):
"""
真正的下载类
:param url:
:return:
"""
try:
response = requests.get(url,headers=headers,timeout=5)
if response.status_code == 200:
result = response.json()
else:
result = None
except:
raise ConnectionError
return result
def get_content(json):
if json is not None:
result_items = json.get("data").get("results")
for item in result_items:
job_title = item.get("jobName")
job_city = item.get("city").get("display")
job_salary = item.get("salary")
job_edu = item.get("eduLevel").get("name")
job_company = item.get("company").get("name")
job_url = item.get("positionURL")
yield {
"job_title":job_title,
"job_city":job_city,
"job_salary":job_salary,
"job_edu":job_edu,
"job_company":job_company,
"job_url":job_url,
}
def download_detail(url):
throttle.wait_url(url)
response = requests.get(url,headers={"User-Agent":random.choice(User_Agent)},timeout=5)
html = lxml.html.fromstring(response.text)
detail1 = html.xpath('//div[@class="tab-inner-cont"]/*/text()')
detail2 = html.xpath('//div[@class="tab-inner-cont"]/p/span/text()')
if detail1 is not None:
job_detail=detail1
elif detail2 is not None:
job_detail = detail2
else:
job_detail = "暂无"
return job_detail
if __name__ == '__main__':
for i in range(1,100):
result = get_page(i*60,i-1)
for content in get_content(result):
print("正在下载:",content["job_url"])
job_detail = download_detail(content["job_url"])
detail = ""
for i in job_detail:
detail += i
detail = ''.join(detail.split())
content["job_detail"] = detail
cursor.execute("insert into Job(jobName,jobSalary,jobCity,jobEdu,jobCompany,jobDetail) values(?,?,?,?,?,?)",[content["job_title"],content["job_salary"],content["job_city"],content["job_edu"],content["job_company"],content["job_detail"]])
db.commit()
这里我加入了一个下载的限速器,防止抓取太快ip被封。